{"title": "Semi-Supervised Factored Logistic Regression for High-Dimensional Neuroimaging Data", "book": "Advances in Neural Information Processing Systems", "page_first": 3348, "page_last": 3356, "abstract": "Imaging neuroscience links human behavior to aspects of brain biology in ever-increasing datasets. Existing neuroimaging methods typically perform either discovery of unknown neural structure or testing of neural structure associated with mental tasks. However, testing hypotheses on the neural correlates underlying larger sets of mental tasks necessitates adequate representations for the observations. We therefore propose to blend representation modelling and task classification into a unified statistical learning problem. A multinomial logistic regression is introduced that is constrained by factored coefficients and coupled with an autoencoder. We show that this approach yields more accurate and interpretable neural models of psychological tasks in a reference dataset, as well as better generalization to other datasets.", "full_text": "Semi-Supervised Factored Logistic Regression for\n\nHigh-Dimensional Neuroimaging Data\n\nDanilo Bzdok, Michael Eickenberg, Olivier Grisel, Bertrand Thirion, Ga\u00a8el Varoquaux\n\n\ufb01rstname.lastname@inria.fr\n\nINRIA, Parietal team, Saclay, France\n\nCEA, Neurospin, Gif-sur-Yvette, France\n\nAbstract\n\nImaging neuroscience links human behavior to aspects of brain biology in ever-\nincreasing datasets. Existing neuroimaging methods typically perform either dis-\ncovery of unknown neural structure or testing of neural structure associated with\nmental tasks. However, testing hypotheses on the neural correlates underlying\nlarger sets of mental tasks necessitates adequate representations for the observa-\ntions. We therefore propose to blend representation modelling and task classi\ufb01ca-\ntion into a uni\ufb01ed statistical learning problem. A multinomial logistic regression\nis introduced that is constrained by factored coef\ufb01cients and coupled with an au-\ntoencoder. We show that this approach yields more accurate and interpretable\nneural models of psychological tasks in a reference dataset, as well as better gen-\neralization to other datasets.\n\nkeywords: Brain Imaging, Cognitive Science, Semi-Supervised Learning, Sys-\ntems Biology\n\n1\n\nIntroduction\n\nMethods for neuroimaging research can be grouped by discovering neurobiological structure or as-\nsessing the neural correlates associated with mental tasks. To discover, on the one hand, spatial\ndistributions of neural activity structure across time, independent component analysis (ICA) is often\nused [6]. It decomposes the BOLD (blood-oxygen level-dependent) signals into the primary modes\nof variation. The ensuing spatial activity patterns are believed to represent brain networks of func-\ntionally interacting regions [26]. Similarly, sparse principal component analysis (SPCA) has been\nused to separate BOLD signals into parsimonious network components [28]. The extracted brain\nnetworks are probably manifestations of electrophysiological oscillation frequencies [17]. Their\nfundamental organizational role is further attested by continued covariation during sleep and anes-\nthesia [10]. Network discovery by applying ICA or SPCA is typically performed on task-unrelated\n(i.e., unlabeled) \u201cresting-state\u201d data. These capture brain dynamics during ongoing random thought\nwithout controlled environmental stimulation. In fact, a large portion of the BOLD signal variation\nis known not to correlate with a particular behavior, stimulus, or experimental task [10].\nTo test, on the other hand, the neural correlates underlying mental tasks, the general linear model\n(GLM) is the dominant approach [13]. The contribution of individual brain voxels is estimated ac-\ncording to a design matrix of experimental tasks. Alternatively, psychophysiological interactions\n(PPI) elucidate the in\ufb02uence of one brain region on another conditioned by experimental tasks [12].\nAs a last example, an increasing number of neuroimaging studies model experimental tasks by train-\ning classi\ufb01cation algorithms on brain signals [23]. All these methods are applied to task-associated\n(i.e., labeled) data that capture brain dynamics during stimulus-guided behavior. Two important\nconclusions can be drawn. First, the mentioned supervised neuroimaging analyses typically yield\nresults in a voxel space. This ignores the fact that the BOLD signal exhibits spatially distributed\n\n1\n\n\fpatterns of coherent neural activity. Second, existing supervised neuroimaging analyses cannot ex-\nploit the abundance of easily acquired resting-state data [8]. These may allow better discovery of the\nmanifold of brain states due to the high task-rest similarities of neural activity patterns, as observed\nusing ICA [26] and linear correlation [9].\nBoth these neurobiological properties can be conjointly exploited in an approach that is mixed\n(i.e., using rest and task data), factored (i.e., performing network decomposition), and multi-\ntask (i.e., capitalize on neural representations shared across mental operations). The integra-\ntion of brain-network discovery into supervised classi\ufb01cation can yield a semi-supervised learn-\ning framework. The most relevant neurobiological structure should hence be identi\ufb01ed for\nthe prediction problem at hand. Autoencoders suggest themselves because they can emulate\nvariants of most unsupervised learning algorithms, including PCA, SPCA, and ICA [15, 16].\nAutoencoders (AE) are layered learning models that con-\ndense the input data to local and global representations\nvia reconstruction under compression prior. They behave\nlike a (truncated) PCA in case of one linear hidden layer\nand a squared error loss [3]. Autoencoders behave like a\nSPCA if shrinkage terms are added to the model weights\nin the optimization objective. Moreover, they have the\ncharacteristics of an ICA in case of tied weights and\nadding a nonlinear convex function at the \ufb01rst layer [18].\nThese authors further demonstrated that ICA, sparse au-\ntoencoders, and sparse coding are mathematically equiva-\nlent under mild conditions. Thus, autoencoders may \ufb02ex-\nibly project the neuroimaging data onto the main direc-\ntions of variation.\nIn the present investigation, a linear autoencoder will\nbe \ufb01t to (unlabeled) rest data and integrated as a rank-\nreducing bottleneck into a multinomial logistic regression\n\ufb01t to (labeled) task data. We can then solve the com-\npound statistical problem of unsupervised data represen-\ntation and supervised classi\ufb01cation, previously studied in\nisolation. From the perspective of dictionary learning, the\n\ufb01rst layer represents projectors to the discovered set of ba-\nsis functions which are linearly combined by the second\nlayer to perform predictions [20]. Neurobiologically, this\nallows delineating a low-dimensional manifold of brain\nnetwork patterns and then distinguishing mental tasks by\ntheir most discriminative linear combinations. Theoreti-\ncally, a reduction in model variance should be achieved\nby resting-state autoencoders that privilege the most neu-\nrobiologically valid models in the hypothesis set. Practi-\ncally, neuroimaging research frequently suffers from data\nscarcity. This limits the set of representations that can be\nextracted from GLM analyses based on few participants. We therefore contribute a computational\nframework that 1) analyzes many problems simultaneously (thus \ufb01nds shared representations by\n\u201cmulti-task learning\u201d) and 2) exploits unlabeled data (since they span a space of meaningful con\ufb01g-\nurations).\n\nFigure 1: Model architecture Linear\nautoencoders \ufb01nd an optimized com-\npression of 79,941 brain voxels into n\nunknown activity patterns by improving\nreconstruction from them. The decom-\nposition matrix equates with the bottle-\nneck of a factored logistic regression.\nSupervised multi-class learning on task\ndata (Xtask) can thus be guided by un-\nsupervised decomposition of rest data\n(Xrest).\n\n2 Methods\n\nData. As the currently biggest openly-accessible reference dataset, we chose resources from the\nHuman Connectome Project (HCP) [4]. Neuroimaging task data with labels of ongoing cognitive\nprocesses were drawn from 500 healthy HCP participants (cf. Appendix for details on datasets). 18\nHCP tasks were selected that are known to elicit reliable neural activity across participants (Table\n1). In sum, the HCP task data incorporated 8650 \ufb01rst-level activity maps from 18 diverse paradigms\nadministered to 498 participants (2 removed due to incomplete data). All maps were resampled to a\ncommon 60 \u21e5 72 \u21e5 60 space of 3mm isotropic voxels and gray-matter masked (at least 10% tissue\n\n2\n\n\fprobability). The supervised analyses were thus based on labeled HCP task maps with 79,941 voxels\nof interest representing z-values in gray matter.\n\nCognitive Task\n1 Reward\n2 Punish\n3 Shapes\n4 Faces\n5 Random\n6 Theory of mind\n7 Mathematics\n8 Language\n9 Tongue movement\n10 Food movement\n11 Hand movement\n12 Matching\n13 Relations\n14 View Bodies\n15 View Faces\n16 View Places\n17 View Tools\n18 Two-Back\n\nStimuli\nCard game\nShape pictures\nFace pictures\nVideos with objects\nSpoken numbers\nAuditory stories\n\nVisual cues\n\nInstruction for participants\nGuess the number of a mystery card for gain/loss of money\nDecide which of two shapes matches another shape geometrically\nDecide which of two faces matches another face emotionally\nDecide whether the objects act randomly or intentionally\nComplete addition and subtraction problems\nChoose answer about the topic of the story\nMove tongue\nSqueezing of the left or right toe\nTapping of the left or right \ufb01nger\n\nShapes with textures Decide whether two objects match in shape or texture\nPictures\nPictures\nPictures\nPictures\nVarious pictures\nTable 1: Description of psychological tasks to predict.\n\nDecide whether object pairs differ both along either shape or texture\nPassive watching\nPassive watching\nPassive watching\nPassive watching\nIndicate whether current stimulus is the same as two items earlier\n\nThese labeled data were complemented by unlabeled activity maps from HCP acquisitions of uncon-\nstrained resting-state activity [25]. These re\ufb02ect brain activity in the absence of controlled thought.\nIn sum, the HCP rest data concatenated 8000 unlabeled, noise-cleaned rest maps with 40 brain maps\nfrom each of 200 randomly selected participants.\nWe were further interested in the utility of the optimized low-rank projection in one task dataset\nfor dimensionality reduction in another task dataset. To this end, the HCP-derived network decom-\npositions were used as preliminary step in the classi\ufb01cation problem of another large sample. The\nARCHI dataset [21] provides activity maps from diverse experimental tasks, including auditory and\nvisual perception, motor action, reading, language comprehension and mental calculation. Analo-\ngous to HCP data, the second task dataset thus incorporated 1404 labeled, grey-matter masked, and\nz-scored activity maps from 18 diverse tasks acquired in 78 participants.\n\nLinear autoencoder. The labeled and unlabeled data were fed into a linear statistical model com-\nposed of an autoencoder and dimensionality-reducing logistic regression. The af\ufb01ne autoencoder\ntakes the input x, projects it into a coordinate system of latent representations z and reconstructs it\nback to x0 by\n\nz = W0x + b0\n\nx0 = W1z + b1,\n\n(1)\nwhere x 2 Rd denotes the vector of d = 79,941 voxel values from each rest map, z 2 Rn is the n-\ndimensional hidden state (i.e., distributed neural activity patterns), and x0 2 Rd is the reconstruction\nvector of the original activity map from the hidden variables. Further, W0 denotes the weight matrix\nthat transforms from input space into the hidden space (encoder), W1 is the weight matrix for back-\nprojection from the hidden variables to the output space (decoder). b0 and b1 are corresponding\nbias vectors. The model parameters W0, b0, b1 are found by minimizing the expected squared\nreconstruction error\n\nE [LAE (x)] = E\u21e5kx  (W1(W0x + b0) + b1)k2\u21e4 .\n\n(2)\n\nHere we choose W0 and W1 to be tied, i.e. W0 = WT\n1 . Consequently, the learned weights are\nforced to take a two-fold function: That of signal analysis and that of signal synthesis. The \ufb01rst\nlayer analyzes the data to obtain the cleanest latent representation, while the second layer represents\nbuilding blocks from which to synthesize the data using the latent activations. Tying these processes\ntogether makes the analysis layer interpretable and pulls all non-zero singular values towards 1.\nNonlinearities were not applied to the activations in the \ufb01rst layer.\n\nFactored logistic regression. Our factored logistic regression model is best described as a variant\nof a multinomial logistic regression. Speci\ufb01cally, the weight matrix is replaced by the product\n\n3\n\n\fof two weight matrices with a common latent dimension. The later is typically much lower than\nthe dimension of the data. Alternatively, this model can be viewed as a single-hidden-layer feed-\nforward neural network with a linear activation function for the hidden layer and a softmax function\non the output layer. As the dimension of the hidden layer is much lower than the input layer, this\narchitecture is sometimes referred to as a \u201clinear bottleneck\u201d in the literature. The probability of an\ninput x to belong to a class i 2{ 1, . . . , l} is given by\n\nP (Y = i|x; V0, V1, c0, c1) = softmaxi(fLR(x)),\n\n(3)\n\nwhere fLR(x) = V1(V0x + c0) + c1 computes multinomial\nexp(xi)/Pj exp(xj). The matrix V0 2 Rdxn transforms the input x 2 Rd into n latent com-\nponents and the matrix V1 2 Rnxl projects the latent components onto hyperplanes that re\ufb02ect l\nlabel probabilities. c0 and c1 are bias vectors. The loss function is given by\n\nlogits and softmaxi(x) =\n\nE [LLR(x, y)] \u21e1\n\n1\n\nNXtask\n\nNXtaskXk=0\n\nlog(P (Y = y(k)|x(k); V0, V1, c0, c1)).\n\n(4)\n\nLayer combination. The optimization problem of the linear autoencoder and the factored logistic\nregression are linked in two ways. First, their transformation matrices mapping from input to the\nlatent space are tied\n\n(5)\nWe hence search for a compression of the 79,941 voxel values into n unknown components that\nrepresent a latent code optimized for both rest and task activity data. Second, the objectives of the\nautoencoder and the factored logistic regression are interpolated in the common loss function\n\nV0 = W0.\n\nL(\u2713, ) = LLR + (1  )\n\n1\n\nNXrest LAE +\u2326 .\n\n(6)\n\nIn so doing, we search for the combined model parameters \u2713 = {V0, V1, c0, c1, b0, b1} with\nrespect to the (unsupervised) reconstruction error and the (supervised) task detection. LAE is de-\nvided by NXrest to equilibrate both loss terms to the same order of magnitude. \u2326 represents an\nElasticNet-type regularization that combines `1 and `2 penalty terms.\n\nOptimization. The common objective was optimized by gradient descent in the SSFLogReg pa-\nrameters. The required gradients were obtained by using the chain rule to backpropagate error\nderivatives. We chose the rmsprop solver [27], a re\ufb01nement of stochastic gradient descent. Rmsprop\ndictates an adaptive learning rate for each model parameter by scaled gradients from a running av-\nerage. The batch size was set to 100 (given much expected redundancy in Xrest and Xtask), matrix\nparameters were initalized by Gaussian random values multiplied by 0.004 (i.e., gain), and bias\nparameters were initalized to 0.\nThe normalization factor and the update rule for \u2713 are given by\n\nv(t+1) = \u21e2v(t) + (1  \u21e2)\u21e3r\u2713f (x(t), y(t),\u2713 (t))\u23182\n\n\u2713(t+1) = \u2713(t) + \u21b5r\u2713f (x(t), y(t),\u2713 (t))\n\npv(t+1) + \u270f\n\n,\n\n(7)\n\nwhere f is the loss function computed on a minibatch sample at timestep t, \u21b5 is the learning rate\n(0.00001), \u270f a global damping factor (106), and \u21e2 the decay rate (0.9 to deemphasize the magni-\ntude of the gradient). Note that we have also experimented with other solvers (stochastic gradient\ndescent, adadelta, and adagrad) but found that rmsprop converged faster and with similar or higher\ngeneralization performance.\n\nImplementation. The analyses were performed in Python. We used nilearn to handle the large\nquantities of neuroimaging data [1] and Theano for automatic, numerically stable differentiation\nof symbolic computation graphs [5, 7]. All Python scripts that generated the results are accessible\nonline for reproducibility and reuse (http://github.com/banilo/nips2015).\n\n4\n\n\f3 Experimental Results\n\nSerial versus parallel structure discovery and classi\ufb01cation. We \ufb01rst tested whether there is a\nsubstantial advantage in combining unsupervised decomposition and supervised classi\ufb01cation learn-\ning. We benchmarked our approach against performing data reduction on the (unlabeled) \ufb01rst half\nof the HCP task data by PCA, SPCA, ICA, and AE (n = 5, 20, 50, 100 components) and learn-\ning classi\ufb01cation models in the (labeled) second half by ordinary logistic regression. PCA reduced\nthe dimensionality of the task data by \ufb01nding orthogonal network components (whitening of the\ndata). SPCA separated the task-related BOLD signals into network components with few regions\nby a regression-type optimization problem constrained by `1 penalty (no orthogonality assumptions,\n1000 maximum iterations, per-iteration tolerance of 10-8, \u21b5 = 1). ICA performed iterative blind\nsource separation by a parallel FASTICA implementation (200 maximum iterations, per-iteration\ntolerance of 0.0001, initialized by random mixing matrix, whitening of the data). AE found a code\nof latent representations by optimizing projection into a bottleneck (500 iterations, same imple-\nmentation as below for rest data). The second half of the task data was projected onto the latent\ncomponents discovered in its \ufb01rst half. Only the ensuing component loadings were submitted to\nordinary logistic regression (no hidden layer, `1 = 0.1, `2 = 0.1, 500 iterations). These serial two-\nstep approaches were compared against parallel decomposition and classi\ufb01cation by SSFLogReg\n(one hidden layers,  = 1, `1 = 0.1, `2 = 0.1, 500 iterations). Importantly, all trained classi\ufb01ca-\ntion models were tested on a large, unseen test set (20% of data) in the present analyses. Across\nchoices for n, SSFLogReg achieved more than 95% out-of-sample accuracy, whereas supervised\nlearning based on PCA, SPCA, ICA, and AE loadings ranged from 32% to 87% (Table 2). This\nexperiment establishes the advantage of directly searching for classi\ufb01cation-relevant structure in the\nfMRI data, rather than solving the supervised and unsupervised problems independently. This effect\nwas particularly pronounced when assuming few hidden dimensions.\n\nn\n5\n20\n50\n100\n\nPCA + LogReg\n\n45.1 %\n78.1 %\n81.7 %\n81.3 %\n\nSPCA + LogReg\n\n32.2 %\n78.2 %\n84.0 %\n82.2 %\n\n37.5 %\n81.0 %\n84.2 %\n87.3 %\n\n44.2 %\n63.2 %\n77.0 %\n76.6 %\n\nSSFLogReg\n\n95.7%\n97.3%\n97.6%\n97.4%\n\nICA + LogReg AE + LogReg\n\nTable 2: Serial versus parallel dimensionality reduction and classi\ufb01cation. Chance is at 5,6%.\n\nModel performance. SSFLogReg was subsequently trained (500 epochs) across parameter\nchoices for the hidden components (n = 5, 20, 100) and the balance between autoencoder and\nlogistic regression ( = 0, 0.25, 0.5, 0.75, 1). Assuming 5 latent directions of variation should yield\nmodels with higher bias and smaller variance than SSFLogReg with 100 latent directions. Given the\n18-class problem of HCP, setting  to 0 consistently yields generalization performance at chance-\nlevel (5,6%) because only the unsupervised layer of the estimator is optimized. At each epoch (i.e.,\niteration over the data), the out-of-sample performance of the trained classi\ufb01er was assessed on 20%\nof unseen HCP data. Additionally, the \u201cout-of-study\u201d performance of the learned decomposition\n(W0) was assessed by using it as dimensionality reduction of an independent labeled dataset (i.e.,\nARCHI) and conducting ordinary logistic regression on the ensuing component loadings.\n\nn = 5\n\nn = 20\n\nn = 100\n\n = 0  = 0.25  = 0.5  = 0.75  = 1  = 0  = 0.25  = 0.5  = 0.75  = 1  = 0  = 0.25  = 0.5  = 0.75  = 1\n\nOut-of-sample\naccuracy\nPrecision (mean)\nRecall (mean)\nF1 score (mean)\nReconstr. error (norm.) 0.76\nOut-of-study\naccuracy\n\n6.0% 88.9% 95.1% 96.5% 95.7% 5.5% 97.4% 97.8% 97.3% 97.3% 6.1% 97.2% 97.0% 97.8% 97.4%\n5.9% 87.0% 94.9% 96.3% 95.4% 5.1% 97.4% 97.1% 97.0% 97.0% 5.9% 96.9% 96.5% 97.5% 96.9%\n5.6% 88.3% 95.2% 96.6% 95.7% 4.6% 97.5% 97.5% 97.4% 97.4% 7.2% 97.2% 97.2% 97.9% 97.4%\n4.1% 86.6% 94.9% 96.4% 95.4% 3.8% 97.4% 97.2% 97.1% 97.1% 5.3% 97.0% 96.7% 97.7% 97.2%\n1.08\n\n1.22\n\n0.85\n\n1.79\n\n0.64\n\n0.60\n\n0.65\n\n0.87\n\n1.01\n\n0.67\n\n0.69\n\n0.68\n\n0.73\n\n0.77\n\n39.4% 60.8% 54.3% 60.7% 62.9% 77.0% 79.7% 81.9% 79.7% 79.4% 79.2% 82.2% 81.7% 81.3% 75.8%\n\nTable 3: Performance of SSFLogReg across model parameter choices. Chance is at 5.6%.\n\nWe made three noteworthy observations (Table 3). First, the most supervised estimator ( = 1)\nachieved in no instance the best accuracy, precision, recall, or f1 scores on HCP data. Classi\ufb01cation\nby SSFLogReg is therefore facilitated by imposing structure from the unlabeled rest data. Con\ufb01rmed\nby the normalized reconstruction error (E = kx  \u02c6xk/kxk), little weight on the supervised term is\nsuf\ufb01cient for good model performance while keeping E low and task-map decomposition rest-like.\n\n5\n\n\fFigure 2: Effect of bottleneck in a 38-task classi\ufb01caton problem Depicts the f1 prediction scores\nfor each of 38 psychological tasks. Multinomial logistic regression operating in voxel space (blue\nbars) was compared to SSFLogReg operating in 20 (left plot) and 100 (right plot) latent modes\n(grey bars). Autoencoder or rest data were not used for these analyses ( = 1). Ordinary logistic\nregression yielded 77.7% accuracy out of sample, while SSFLogReg scored at 94.4% (n = 20) and\n94.2% (n = 100). Hence, compressing the voxel data into a component space for classi\ufb01cation\nachieves higher task separability. Chance is at 2, 6%.\n\nSecond, the higher the number of latent components n, the higher the out-of-study performance with\nsmall values of . This suggests that the presence of more rest-data-inspired hidden components\nresults in more effective feature representations in unrelated task data. Third, for n = 20 and 100\n(but not 5) the purely rest-data-trained decomposition matrix ( = 0) resulted in noninferior out-of-\nstudy performance of 77.0% and 79.2%, respectively (Table 3). This con\ufb01rms that guiding model\nlearning by task-unrelated structure extracts features of general relevance beyond the supervised\nproblem at hand.\n\nIndividual effects of dimensionality reduction and rest data. We \ufb01rst quanti\ufb01ed the impact of\nintroducing a bottleneck layer disregarding the autoencoder. To this end, ordinary logistic regression\nwas juxtaposed with SSFLogReg at  = 1. For this experiment, we increased the dif\ufb01culty of the\nclassi\ufb01cation problem by including data from all 38 HCP tasks. Indeed, increased class separability\nin component space, as compared to voxel space, entails differences in generalization performance\nof \u21e1 17% (Figure 2). Notably, the cognitive tasks on reward and punishment processing are among\nthe least predicted with ordinary but well predicted with low-rank logistic regression (tasks 1 and\n2 in Figure 2). These experimental conditions have been reported to exhibit highly similar neural\nactivity patterns in GLM analyses of that dataset [4]. Consequently, also local activity differences (in\nthe striatum and visual cortex in this case) can be successfully captured by brain-network modelling.\n\nWe then contemplated the impact of rest structure (Figure 3) by modulating its in\ufb02uence ( =\n0.25, 0.5, 0.75) in data-scarce and data-rich settings (n = 20, `1 = 0.1, `2 = 0.1). At the beginning\nof every epoch, 2000 task and 2000 rest maps were drawn with replacement from same amounts of\ntask and rest maps. In data-scarce scenarios (frequently encountered by neuroimaging practitioners),\nthe out-of-sample scores improve as we depart from the most supervised model ( = 1). In data-rich\nscenarios, we observed the same trend to be apparent.\n\nFeature identi\ufb01cation. We \ufb01nally examined whether the models were \ufb01t for purpose (Figure 4).\nTo this end, we computed Pearson\u2019s correlation between the classi\ufb01er weights and the averaged\nneural activity map for each of the 18 tasks. Ordinary logistic regression thus yielded a mean cor-\nrelation of \u21e2 = 0.28 across tasks. For SSFLogReg ( = 0.25, 0.5, 0.75, 1), a per-class-weight map\nwas computed by matrix multiplication of the two inner layers. Feature identi\ufb01cation performance\nthus ranged between \u21e2 = 0.35 and \u21e2 = 0.55 for n = 5, between \u21e2 = 0.59 and \u21e2 = 0.69 for n = 20,\nand between \u21e2 = 0.58 and \u21e2 = 0.69 for n = 100. Consequently, SSFLogReg puts higher absolute\nweights on relevant structure. This re\ufb02ects an increased signal-to-noise ratio, in part explained by\n\n6\n\n\fFigure 3: Effect of rest structure Model performance of SSFLogReg (n = 20, `1 = 0.1, `2 = 0.1)\nfor different choices of  in data-scarce (100 task and 100 rest maps, hot color) and data-rich (1000\ntask and 1000 rest maps, cold color) scenarios. Gradient descent was performed on 2000 task and\n2000 rest maps. At the begining of each epoch, these were drawn with replacement from a pool of\n100 or 1000 different task and rest maps, respectively. Chance is at 5.6%.\n\nFigure 4: Classi\ufb01cation weight maps The voxel predictors corresponding to 5 exemplary (of 18\ntotal) psychological tasks (rows) from the HCP dataset [4]. Left column: multinomial logistic re-\ngression (same implementation but without bottleneck or autoencoder), middle column: SSFLogReg\n(n = 20 latent components,  = 0.5, `1 = 0.1, `2 = 0.1), right column: voxel-wise average across\nall samples of whole-brain activity maps from each task. SSFLogReg a) puts higher absolute weights\non relevant structure, b) lower ones on irrelevant structure, and c) yields BOLD-typical local con-\ntiguity (without enforcing an explicit spatial prior). All values are z-scored and thresholded at the\n75th percentile.\n\nthe more BOLD-typical local contiguity. Conversely, SSFLogReg puts lower probability mass on\nirrelevant structure. Despite lower interpretability of the results from ordinary logistic regression,\nthe salt-and-pepper-like weight maps were suf\ufb01cient for good classi\ufb01cation performance. Hence,\nSSFLogReg yielded class weights that were much more similar to features of the respective training\nsamples for all choices of n and . SSFLogReg therefore captures genuine properties of task activity\npatterns, rather than participant- or study-speci\ufb01c artefacts.\n\n7\n\n\fMiscellaneous observations. For the sake of completeness, we informally report modi\ufb01cations of\nthe statistical model that did not improve generalization performance. a) Introducing stochasticity\ninto model learning by input corruption of Xtask deteriorated model performance in all scenarios.\nAdding b) recti\ufb01ed linear units (ReLU) to W0 or other commonly used nonlinearities (c) sigmoid,\nd) softplus, e) hyperbolic tangent) all led to decreased classi\ufb01cation accuracies, probably due to\nsample size limits. Further, f ) \u201cpretraining\u201d of the bottleneck W0 (i.e., non-random initialization)\nby either corresponding PCA, SPCA or ICA loadings did not exhibit improved accuracies, neither\ndid g) autoencoder pretraining. Moreover, introducing an additional h) overcomplete layer (100\nunits) after the bottleneck was not advantageous. Finally, imposing either i) only `1 or j) only `2\npenalty terms was disadvantageous in all tested cases. This favored ElasticNet regularization chosen\nin the above analyses.\n\n4 Discussion and Conclusion\n\nUsing the \ufb02exibility of factored models, we learn the low-dimensional representation from high-\ndimensional voxel brain space that is most important for prediction of cognitive task sets. From\na machine-learning perspective, factorization of the logistic regression weights can be viewed as\ntransforming a \u201cmulti-class classi\ufb01cation problem\u201d into a \u201cmulti-task learning problem\u201d. The higher\ngeneralization accuracy and support recovery, comparing to ordinary logistic regression, hold po-\ntential for adoption in various neuroimaging analyses. Besides increased performance, these models\nare more interpretable by automatically learning a mapping to and from a brain-network space. This\ndomain-speci\ufb01c learning algorithm encourages departure from the arti\ufb01cial and statistically less at-\ntractive voxel space. Neurobiologically, brain activity underlying de\ufb01ned mental operations can be\nexplained by linear combinations of the main activity patterns. That is, fMRI data probably con-\ncentrate near a low-dimensional manifold of characteristic brain network combinations. Extracting\nfundamental building blocks of brain organization might facilitate the quest for the cognitive prim-\nitives of human thought. We hope that these \ufb01rst steps stimulate development towards powerful\nsemi-supervised representation extraction in systems neuroscience.\nIn the future, automatic reduction of brain maps to their neurobiological essence may leverage data-\nintense neuroimaging investigations. Initiatives for data collection are rapidly increasing in neu-\nroscience [22]. These promise structured integration of neuroscienti\ufb01c knowledge accumulating\nin databases. Tractability by condensed feature representations can avoid the ill-posed problem\nof learning the full distribution of activity patterns. This is not only relevant to the multi-class\nchallenges spanning the human cognitive space [24] but also the multi-modal combination with\nhigh-resolution 3D models of brain anatomy [2] and high-throughput genomics [19]. The biggest\nsocioeconomic potential may lie in across-hospital clinical studies that predict disease trajectories\nand drug responses in psychiatric and neurological populations [11].\n\nAcknowledgment The research leading to these results has received funding from the European Union\nSeventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project).\nData were provided by the Human Connectome Project. Further support was received from the German Na-\ntional Academic Foundation (D.B.) and the MetaMRI associated team (B.T., G.V.).\n\nReferences\n\n[1] Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossai\ufb01, J., Gramfort, A., Thirion,\nB., Varoquaux, G.: Machine learning for neuroimaging with scikit-learn. Front Neuroinform 8, 14 (2014)\n[2] Amunts, K., Lepage, C., Borgeat, L., Mohlberg, H., Dickscheid, T., Rousseau, M.E., Bludau, S., Bazin,\nP.L., Lewis, L.B., Oros-Peusquens, A.M., et al.: Bigbrain: an ultrahigh-resolution 3d human brain model.\nScience 340(6139), 1472\u20131475 (2013)\n\n[3] Baldi, P., Hornik, K.: Neural networks and principal component analysis: Learning from examples with-\n\nout local minima. Neural networks 2(1), 53\u201358 (1989)\n\n[4] Barch, D.M., Burgess, G.C., Harms, M.P., Petersen, S.E., Schlaggar, B.L., Corbetta, M., Glasser, M.F.,\nCurtiss, S., Dixit, S., Feldt, C.: Function in the human connectome: task-fmri and individual differences\nin behavior. Neuroimage 80, 169\u2013189 (2013)\n\n8\n\n\f[5] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., Warde-\nFarley, D., Bengio, Y.: Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590\n(2012)\n\n[6] Beckmann, C.F., DeLuca, M., Devlin, J.T., Smith, S.M.: Investigations into resting-state connectivity\nusing independent component analysis. Philos Trans R Soc Lond B Biol Sci 360(1457), 1001\u201313 (2005)\n[7] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley,\nD., Bengio, Y.: Theano: a cpu and gpu math expression compiler. Proceedings of the Python for scienti\ufb01c\ncomputing conference (SciPy) 4, 3 (2010)\n\n[8] Biswal, B.B., Mennes, M., Zuo, X.N., Gohel, S., Kelly, C., et al.: Toward discovery science of human\n\nbrain function. Proc Natl Acad Sci U S A 107(10), 4734\u20139 (2010)\n\n[9] Cole, M.W., Bassettf, D.S., Power, J.D., Braver, T.S., Petersen, S.E.: Intrinsic and task-evoked network\n\narchitectures of the human brain. Neuron 83c, 238251 (2014)\n\n[10] Fox, D.F., Raichle, M.E.: Spontaneous \ufb02uctuations in brain activity observed with functional magnetic\n\nresonance imaging. Nat Rev Neurosci 8, 700\u2013711 (2007)\n\n[11] Frackowiak, R., Markram, H.: The future of human cerebral cartography: a novel approach. Philosophical\n\nTransactions of the Royal Society of London B: Biological Sciences 370(1668), 20140171 (2015)\n\n[12] Friston, K.J., Buechel, C., Fink, G.R., Morris, J., Rolls, E., Dolan, R.J.: Psychophysiological and modu-\n\nlatory interactions in neuroimaging. Neuroimage 6(3), 218\u201329 (1997)\n\n[13] Friston, K.J., Holmes, A.P., Worsley, K.J., Poline, J.P., Frith, C.D., Frackowiak, R.S.: Statistical paramet-\n\nric maps in functional imaging: a general linear approach. Hum Brain Mapp 2(4), 189\u2013210 (1994)\n\n[14] Gorgolewski, K., Burns, C.D., Madison, C., Clark, D., Halchenko, Y.O., Waskom, M.L., Ghosh, S.S.:\nNipype: a \ufb02exible, lightweight and extensible neuroimaging data processing framework in python. Front\nNeuroinform 5, 13 (2011)\n\n[15] Hertz, J., Krogh, A., Palmer, R.G.: Introduction to the theory of neural computation, vol. 1. Basic Books\n\n(1991)\n\n[16] Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science\n\n313(5786), 504\u2013507 (2006)\n\n[17] Hipp, J.F., Siegel, M.: Bold fmri correlation re\ufb02ects frequency-speci\ufb01c neuronal correlation. Curr Biol\n\n(2015)\n\n[18] Le, Q.V., Karpenko, A., Ngiam, J., Ng, A.: Ica with reconstruction cost for ef\ufb01cient overcomplete feature\n\nlearning pp. 1017\u20131025 (2011)\n\n[19] Need, A.C., Goldstein, D.B.: Whole genome association studies in complex diseases: where do we stand?\n\nDialogues in clinical neuroscience 12(1), 37 (2010)\n\n[20] Olshausen, B., et al.: Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code for\n\nnatural images. Nature 381(6583), 607\u2013609 (1996)\n\n[21] Pinel, P., Thirion, B., Meriaux, S., Jobert, A., Serres, J., Le Bihan, D., Poline, J.B., Dehaene, S.: Fast\nreproducible identi\ufb01cation and large-scale databasing of individual functional cognitive networks. BMC\nNeurosci 8, 91 (2007)\n\n[22] Poldrack, R.A., Gorgolewski, K.J.: Making big data open: data sharing in neuroimaging. Nature Neuro-\n\nscience 17(11), 1510\u20131517 (2014)\n\n[23] Poldrack, R.A., Halchenko, Y.O., Hanson, S.J.: Decoding the large-scale structure of brain function by\n\nclassifying mental states across individuals. Psychol Sci 20(11), 1364\u201372 (2009)\n\n[24] Schwartz, Y., Thirion, B., Varoquaux, G.: Mapping cognitive ontologies to and from the brain. Advances\n\nin Neural Information Processing Systems (2013)\n\n[25] Smith, S.M., Beckmann, C.F., Andersson, J., Auerbach, E.J., Bijsterbosch, J., Douaud, G., Duff, E.,\nFeinberg, D.A., Griffanti, L., Harms, M.P., et al.: Resting-state fmri in the human connectome project.\nNeuroImage 80, 144\u2013168 (2013)\n\n[26] Smith, S.M., Fox, P.T., Miller, K.L., Glahn, D.C., Fox, P.M., Mackay, C.E., Filippini, N., Watkins, K.E.,\nToro, R., Laird, A.R., Beckmann, C.F.: Correspondence of the brain\u2019s functional architecture during\nactivation and rest. Proc Natl Acad Sci U S A 106(31), 13040\u20135 (2009)\n\n[27] Tieleman, T., Hinton, G.: Lecture 6.5rmsprop: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural Networks for Machine Learning (2012)\n\n[28] Varoquaux, G., Gramfort, A., Pedregosa, F., Michel, V., Thirion, B.: Multi-subject dictionary learning to\nsegment an atlas of brain spontaneous activity. Information Processing in Medical Imaging pp. 562\u2013573\n(2011)\n\n9\n\n\f", "award": [], "sourceid": 1844, "authors": [{"given_name": "Danilo", "family_name": "Bzdok", "institution": "INRIA"}, {"given_name": "Michael", "family_name": "Eickenberg", "institution": null}, {"given_name": "Olivier", "family_name": "Grisel", "institution": null}, {"given_name": "Bertrand", "family_name": "Thirion", "institution": "INRIA"}, {"given_name": "Gael", "family_name": "Varoquaux", "institution": "Parietal Team, INRIA"}]}