{"title": "Learning Sensor Multiplexing Design through Back-propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 3081, "page_last": 3089, "abstract": "Recent progress on many imaging and vision tasks has been driven by the use of deep feed-forward neural networks, which are trained by propagating gradients of a loss defined on the final output, back through the network up to the first layer that operates directly on the image. We propose back-propagating one step further---to learn camera sensor designs jointly with networks that carry out inference on the images they capture. In this paper, we specifically consider the design and inference problems in a typical color camera---where the sensor is able to measure only one color channel at each pixel location, and computational inference is required to reconstruct a full color image. We learn the camera sensor's color multiplexing pattern by encoding it as layer whose learnable weights determine which color channel, from among a fixed set, will be measured at each location. These weights are jointly trained with those of a reconstruction network that operates on the corresponding sensor measurements to produce a full color image. Our network achieves significant improvements in accuracy over the traditional Bayer pattern used in most color cameras. It automatically learns to employ a sparse color measurement approach similar to that of a recent design, and moreover, improves upon that design by learning an optimal layout for these measurements.", "full_text": "Learning Sensor Multiplexing\n\nDesign through Back-propagation\n\nAyan Chakrabarti\n\nToyota Technological Institute at Chicago\n\n6045 S. Kenwood Ave., Chicago, IL\n\nayanc@ttic.edu\n\nAbstract\n\nRecent progress on many imaging and vision tasks has been driven by the use of\ndeep feed-forward neural networks, which are trained by propagating gradients of\na loss de\ufb01ned on the \ufb01nal output, back through the network up to the \ufb01rst layer that\noperates directly on the image. We propose back-propagating one step further\u2014to\nlearn camera sensor designs jointly with networks that carry out inference on the\nimages they capture. In this paper, we speci\ufb01cally consider the design and inference\nproblems in a typical color camera\u2014where the sensor is able to measure only one\ncolor channel at each pixel location, and computational inference is required to\nreconstruct a full color image. We learn the camera sensor\u2019s color multiplexing\npattern by encoding it as layer whose learnable weights determine which color\nchannel, from among a \ufb01xed set, will be measured at each location. These weights\nare jointly trained with those of a reconstruction network that operates on the\ncorresponding sensor measurements to produce a full color image. Our network\nachieves signi\ufb01cant improvements in accuracy over the traditional Bayer pattern\nused in most color cameras. It automatically learns to employ a sparse color\nmeasurement approach similar to that of a recent design, and moreover, improves\nupon that design by learning an optimal layout for these measurements.\n\n1\n\nIntroduction\n\nWith the availability of cheap computing power, modern cameras can rely on computational post-\nprocessing to extend their capabilities under the physical constraints of existing sensor technology.\nSophisticated techniques, such as those for denoising [3, 28], deblurring [19, 26], etc., are increasingly\nbeing used to improve the quality of images and videos that were degraded during acquisition.\nMoreover, researchers have posited novel sensing strategies that, when combined with post-processing\nalgorithms, are able to produce higher quality and more informative images and videos. For example,\ncoded exposure imaging [18] allows better inversion of motion blur, coded apertures [14, 23] allow\npassive measurement of scene depth from a single shot, and compressive measurement strategies [1,\n8, 25] combined with sparse reconstruction algorithms allow the recovery of visual measurements\nwith higher spatial, spectral, and temporal resolutions.\n\nKey to the success of these latter approaches is the co-design of sensing strategies and inference\nalgorithms, where the measurements are designed to provide information complimentary to the\nknown statistical structure of natural scenes. So far, sensor design in this regime has largely been\neither informed by expert intuition (e.g., [4]), or based on the decision to use a speci\ufb01c image model\nor inference strategy\u2014e.g., measurements corresponding to random [1], or dictionary-speci\ufb01c [5],\nprojections are a common choice for sparsity-based reconstruction methods. In this paper, we seek to\nenable a broader data-driven exploration of the joint sensor and inference method space, by learning\nboth sensor design and the computational inference engine end-to-end.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: We propose a method to learn the optimal color multiplexing pattern for a camera through\njoint training with a neural network for reconstruction. (Top) Given C possible color \ufb01lters that could\nbe placed at each pixel, we parameterize the incident light as a C\u2212channel image. This acts as input\nto a \u201csensor layer\u201d that learns to select one of these channel at each pixel. A reconstruction network\nthen processes these measurements to yield a full-color RGB image. We jointly train both for optimal\nreconstruction quality. (Bottom left) Since the hard selection of individual color channels is not\ndifferentiable, we encode these decisions using a Soft-max layer, with a \u201ctemperature\u201d parameter \u03b1\nthat is increased across iterations. (Bottom right) We use a bifurcated architecture with two paths\nfor the reconstruction network. One path produces K possible values for each color intensity through\nmultiplicative and linear interpolation, and the other weights to combine these into a single estimate.\n\nWe leverage the successful use of back-propagation and stochastic gradient descent (SGD) [13] in\nlearning deep neural networks for various tasks [12, 16, 20, 24]. These networks process a given\ninput through a complex cascade of layers, and training is able to jointly optimize the parameters of\nall layers to enable the network to succeed at the \ufb01nal inference task. Treating optical measurement\nand computational inference as a cascade, we propose using the same approach to learn both jointly.\nWe encode the sensor\u2019s design choices into the learnable parameters of a \u201csensor layer\u201d which, once\ntrained, can be instantiated by camera optics. This layer\u2019s output is fed to a neural network that carries\nout inference computationally on the corresponding measurements. Both are then trained jointly.\n\nWe demonstrate this approach by applying it to the sensor-inference design problem in a standard\ndigital color camera. Since image sensors can physically measure only one color channel at each\npixel, cameras spatially multiplex the measurement of different colors across the sensor plane, and\nthen computationally recover the missing intensities through a reconstruction process known as\ndemosaicking. We jointly learn the spatial pattern for multiplexing different color channels\u2014that\nrequires making a hard decision to use one of a discrete set of color \ufb01lters at each pixel\u2014along with\na neural network that performs demosaicking. Together, these enable the recovery of high-quality\ncolor images of natural scenes. We \ufb01nd that our approach signi\ufb01cantly outperforms the traditional\nBayer pattern [2] used in most color cameras. We also compare it to a recently introduced design [4]\nbased on making sparse color measurements, that has superior noise performance and fewer aliasing\nartifacts. Interestingly, our network automatically learns to employ a similar measurement strategy,\nbut is able outperform this design by \ufb01nding a more optimal spatial layout for the color measurements.\n\n2\n\n\f2 Background\n\nSince both CMOS and CCD sensors can measure only the total intensity of visible light incident on\nthem, color is typically measured by placing an array of color \ufb01lters (CFA) in front of the sensor\nplane. The CFA pattern determines which color channel is measured at which pixel, with the most\ncommonly pattern used in RGB color cameras being the Bayer mosaic [2] introduced in 1976. This\nis a 4 \u00d7 4 repeating pattern, with two measurements of the green channel and one each of red and\nblue. The color values that are not directly measured are then reconstructed computationally by\ndemosaciking algorithms. These algorithms [15] typically rely on the assumption that different\ncolor channels are correlated and piecewise smooth, and reason about locations of edges and other\nhigh-frequency image content to avoid creating aliasing artifacts.\n\nThis approach yields reasonable results, and the Bayer pattern remains in widespread use even today.\nHowever, the choice of the CFA pattern involves a trade-off. Color \ufb01lters placed in front of the sensor\nblock part of the incident light energy, leading to longer exposure times or noisier measurements (in\ncomparison to grayscale cameras). Moreover, since every channel is regularly sub-sampled in the\nBayer pattern, reconstructions are prone to visually disturbing aliasing artifacts even with the best\nreconstruction methods. Most consumer cameras address this by placing an anti-aliasing \ufb01lter in\nfront of the sensor to blur the incident light \ufb01eld, but this leads to a loss of sharpness and resolution.\n\nTo address this, Chakrabarti et al. [4] recently proposed the use of an alternative CFA pattern in which\na majority of the pixels measure the total un\ufb01ltered visible light intensity. Color is measured only\nsparsely, using 2 \u00d7 2 Bayer blocks placed at regularly spaced intervals on the otherwise un\ufb01ltered\nsensor plane. The resulting measured image corresponds to an un-aliased full resolution luminance\nimage (i.e., the un\ufb01ltered measurements) with \u201choles\u201d at the color sampling site; with point-wise\ncolor information on a coarser grid. The reconstruction algorithm in [4] is signi\ufb01cantly different\nfrom traditional demosaicking, and involves \ufb01rst recovering missing luminance values by hole-\ufb01lling\n(which is computationally easier than up-sampling since there is more context around the missing\nintensities), and then propagating chromaticities from the color measurement sites to the remaining\npixels using edges in the luminance image as a guide. This approach was shown to signi\ufb01cantly\nimprove upon the capabilities of a Bayer sensor\u2014in terms of better noise performance, increased\nsharpness, and reduced aliasing artifacts.\n\nThat [4]\u2019s CFA pattern required a very different reconstruction algorithm illustrates the fact that\nboth the sensor and inference method need to be modi\ufb01ed together to achieve gains in performance.\nIn [4]\u2019s case, this was achieved by applying an intuitive design principles\u2014of making high SNR\nnon-aliased measurements of one color channel. However, these principles are tied to a speci\ufb01c\nreconstruction approach, and do not tell us, for example, whether regularly spaced 2 \u00d7 2 blocks are\nthe optimal way of measuring color sparsely.\n\nWhile learning-based methods have been proposed for demosaicking [10, 17, 22] (as well as for joint\ndemosaicking and denoising [9, 11]), these work with a pre-determined CFA pattern and training is\nused only to tune the reconstruction algorithm. In contrast, our approach seeks to learn, automatically\nfrom data, both the CFA pattern and reconstruction method, so that they are jointly optimal in terms\nof reconstruction quality.\n\n3 Jointly Learning Measurement and Reconstruction\n\nWe formulate our task as that of reconstructing an RGB image y(n) \u2208 R3, where n \u2208 Z2 indexes\npixel location, from a measured sensor image s(n) \u2208 R. Along with this reconstruction task,\nwe also have to choose a multiplexing pattern which determines the color channel that each s(n)\ncorresponds to. We let this choice be between one of C channels\u2014a parameterization that takes\ninto account which spectral \ufb01lters can be physically synthesized. We use x(n) \u2208 RC denote the\nintensity measurements corresponding to each of these color channels, and a zero-one selection map\nI(n) \u2208 {0, 1}C , |I(n)| = 1 to encode the multiplexing pattern, such that the corresponding sensor\nmeasurements are given by s(n) = I(n)T x(n). Moreover, we assume that I(n) repeats periodically\nevery P pixels, and therefore only has P 2 unique values.\n\nGiven a training set consisting of pairs of output images y(n) and C-channel input images x(n),\nour goal then is to learn this pattern I(n), jointly with a reconstruction algorithm that maps the\ncorresponding measurements s(n) to the full color image output y(n). We use a neural network\n\n3\n\n\fto map sensor measurements s(n) to an estimate \u02c6y(n) of the full color image. Furthermore, we\nencode the measurement process into a \u201csensor layer\u201d, which maps the input x(n) to measurements\ns(n), and whose learnable parameters encode the multiplexing pattern I(n). We then learn both\nthe reconstruction network and the sensor layer simultaneously, with respect to a squared loss\nk\u02c6y(n) \u2212 y(n)k2 between the reconstructed and true color images.\n\n3.1 Learning the Multiplexing Pattern\n\nThe key challenge to our joint learning problem lies in recovering the optimal multiplexing pattern\nI(n), since it is ordinal-valued and requires learning to make a hard non-differentiable decision\nbetween C possibilities. To address this, we rely on the standard soft-max operation, which is\ntraditionally used in multi-label classi\ufb01cation tasks.\n\nHowever, we are unable to use the soft-max operation directly\u2014unlike in classi\ufb01cation tasks where\nthe ordinal labels are the \ufb01nal output, and where the training objective prefers hard assignment\nto a single label, in our formulation I(n) is used to generate sensor measurements that are then\nprocessed by a reconstruction network. Indeed, when using a straight soft-max, we \ufb01nd that the\nreconstruction network converges to real-valued I(n) maps that correspond to measuring different\nweighted combinations of the input channels. Thresholding the learned I(n) to be ordinal valued\nleads to a signi\ufb01cant drop in performance, even when we further train the reconstruction network to\nwork with this thresholded version.\n\nOur solution to this is fairly simple. We use a soft-max with a temperature parameter that is increased\nslowly through training iterations. Speci\ufb01cally, we learn a vector w(n) \u2208 RC for each location n of\nthe multiplexing pattern, with the corresponding I(n) given during training as:\n\nI(n) = Soft-max [\u03b1tw(n)] ,\n\n(1)\n\nwhere \u03b1t is a scalar factor that we increase with iteration number t.\n\nTherefore, in addition to changes due to the SGD updates to w(n), the effective distribution of I(n)\nbecome \u201cpeakier\u201d at every iteration because of the increasing \u03b1t, and as \u03b1t \u2192 \u221e, I(n) becomes a\nzero-one vector. Note that the gradient magnitudes of w(n) also scale-up, since we compute these\ngradients at each iteration with respect to the current value of t. This ensures that the pattern can keep\nlearning in the presence of a strong supervisory signal from the loss, while retaining a bias to drift\ntowards making a hard choice for a single color channel.\n\nAs illustrated in Fig. 1, our sensor layer contains a parameter vector w(n) for each pixel of the\nP \u00d7 P multiplexing pattern. During training, we generate the corresponding I(n) vectors using\n(1) above, and the layer then outputs sensor measurements based on the C-channel input x(n) as\ns(n) = I(n)T x(n). Once training is complete (and for validation during training), we replace I(n)\nwith its zero-one version as I(n)c = 1 for c = arg maxc wc(n), and 0 otherwise.\n\nAs we report in Sec. 4, our approach is able to successfully learn an optimal sensing pattern, which\nadapts during training to match the evolving reconstruction network. We would also like to note here\ntwo alternative strategies that we explored to learn an ordinal I(n), which were not as successful. We\nconsidered using a standard soft-max approach with a separate entropy penalty on the distribution\nI(n)\u2014however, this caused the pattern I(n) to stop learning very early during training (or for lower\nweighting of the penalty, had no effect at all). We also tried to incrementally pin the lowest I(n)\nvalues to zero after training for a number of iterations, in a manner similar to Han et al.\u2019s [7] approach\nto network compression. However, even with signi\ufb01cant tuning, this approach caused a large parts of\nthe pattern search space to be eliminated early, and was not able to adapt to the fact that a channel\nwith a low weight at a particular location might eventually become desirable based on changes to the\npattern at other locations, and corresponding updates to the reconstruction network.\n\n3.2 Reconstruction Network Architecture\n\nTraditional demosaicking algorithms [15] produce a full color image by interpolating the missing\ncolor values from neighboring measurement sites, and by exploiting cross-channel dependencies.\nThis interpolation is often linear, but in some cases takes the form of transferring chromaticities or\ncolor ratios (e.g., in [4]). Moreover, most demosaicking algorithms reason about image textures and\nedges to avoid smoothing across boundaries or creating aliasing artifacts.\n\n4\n\n\fWe adopt a simple bifurcated network architecture that leverages these intuitions. As illustrated in\nFig. 1, our network reconstructs each P \u00d7 P patch in y(n) from a receptive \ufb01eld that is centered on\nthat patch in the measured image s(n), and thrice as large in each dimension. The network has two\npaths, both of operate on the entire input and both output (P \u00d7 P \u00d7 3K) values, i.e., K values for\neach output color intensity. We denote these outputs as \u03bb(n, k), f (n, k) \u2208 R3.\n\nOne path produces f (n, k) by \ufb01rst computing multiplicative combinations of the entire 3P \u00d7 3P\ninput patch\u2014we instantiate this using a fully-connected layer without a bias term that operates in\nthe log-domain\u2014followed by a linear combinations across each of the 3K values at each location.\nWe interpret these f (n, k) values as K proposals for each y(n). The second path uses a more\nstandard cascade of convolution layers\u2014all of which have F outputs with the \ufb01rst layer having a\nstride of P \u2014followed by a fully connected layer that produces the outputs \u03bb(n, k) with the same\ndimensionality as f (n, k). We treat \u03bb(n, k) as gating values for the proposals f (n, k), and generate\nthe \ufb01nal reconstructed patch \u02c6y(n) as Pk \u03bb(n, k)f (n, k).\n\n4 Experiments\n\nWe follow a similar approach to [4] for training and evaluating our method. Like [4], we use\nthe Gehler-Shi database [6, 21] that consists of 568 color images of indoor and outdoor scenes,\ncaptured under various illuminants. These images were obtained from RAW sensor images from a\ncamera employing the Bayer pattern with an anti-aliasing optical \ufb01lter, by using the different color\nmeasurements in each Bayer block to construct a single RGB pixel. These images are therefore at half\nthe resolution of the original sensor image, but have statistics that are representative of aliasing-free\nfull color images of typical natural scenes. Unlike [4] who only used 10 images for evaluation, we\nuse the entire dataset\u2014using 56 images for testing, 461 images for training, and the remaining 51\nimages as a validation set to \ufb01x hyper-parameters.\n\nWe treat the images in the dataset as the ground truth for the output RGB images y(n). As sensor\nmeasurements, we consider C = 4 possible color channels. The \ufb01rst three correspond to the original\nsensor RGB channels. Like [4], we choose the fourth channel to be white or panchromatic, and\nconstruct it as the sum of the RGB measurements. As mentioned in [4], this corresponds to a\nconservative estimate of the light-ef\ufb01ciency of an un\ufb01ltered channel. We construct the C-channel\ninput image x(n) by including these measurements, followed by addition of different levels of\nGaussian noise, with high noise variances simulating low-light capture.\n\nWe learn a repeating pattern with P = 8. In our reconstruction network, we set the number of\nproposals K for each output intensity to 24, and the number of convolutional layer outputs F in\nthe second path of our network to 128. When learning our sensor multiplexing pattern, we increase\nthe scalar soft-max factor \u03b1t in (1) according to a quadratic schedule as \u03b1t = 1 + (\u03b3t)2, where\n\u03b3 = 2.5 \u00d7 10\u22125 in our experiments. We train a separate reconstruction network for each noise level\n(positing that a camera could select between these based on the ISO settings). However, since it is\nimpractical to employ different sensors for different settings, we learn a single spatial multiplexing\npattern, optimized for reconstruction under moderate noise levels with standard deviation (STD) of\n0.01 (with respect to intensity values in x(n) scaled to be between 0 and 1).\n\nWe train our sensor layer and reconstruction network jointly at this noise level on sets of 8 \u00d7 8\ny(n) patches and corresponding 24 \u00d7 24 x(n) patches sampled randomly from the training set. We\nuse a batch-size of 128, with a learning rate of 0.001 for 1.5 million iterations. Then, keeping the\nsensor pattern \ufb01xed to our learned version, we train reconstruction networks from scratch for other\nnoise levels\u2014training again with a learning rate of 0.001 for 1.5 million iterations, followed another\n100,000 iterations with a rate of 10\u22124. We also train reconstruction networks at all noise levels\nin a similar way for the Bayer pattern, as well the pattern of [4] (with a color sampling rate of 4).\nMoreover, to allow consistent comparisons, we re-train the reconstruction network for our pattern at\nthe 0.01 noise level from scratch following this regime.\n\n4.1 Evaluating the Reconstruction Network\n\nWe begin by comparing the performance of our learned reconstruction networks to traditional\ndemosaicking algorithms for the standard Bayer pattern, and the pattern of [4]. Note that our goal\nis not to propose a new demosaicking method for existing sensors. Nevertheless, since our sensor\n\n5\n\n\fIt # 2,500\n\nEntropy: 1.38\n\nIt # 5,000\n\nEntropy: 1.38\n\nIt # 7,500\n\nEntropy: 1.38\n\nIt # 10,000\n\nEntropy: 1.38\n\nIt # 12,500\n\nEntropy: 1.38\n\nIt # 25,000\n\nEntropy: 1.37\n\nIt # 100,000\nEntropy: 1.02\n\nIt # 200,000\nEntropy: 0.78\n\nIt # 300,000\nEntropy: 0.75\n\nIt # 400,000\nEntropy: 0.82\n\nIt # 500,000\nEntropy: 0.86\n\nIt # 600,000\nEntropy: 0.85\n\nIt # 1,000,000\nEntropy: 0.57\n\nIt # 1,100,000\nEntropy: 0.37\n\nIt # 1,200,000\nEntropy: 0.35\n\nIt # 1,300,000\nEntropy: 0.25\n\nIt # 1,400,000\nEntropy: 0.18\n\nIt # 1,500,0000\n\n(Final)\n\nFigure 2: Evolution of sensor pattern through training iterations. We \ufb01nd that the our network\u2019s color\nsensing pattern changes qualitatively through the training process. In initial iterations, the sensor\nlayer learns to sample color channels directly. As training continues, these color measurements are\nreplaced by panchromatic (white) pixels. The \ufb01nal iterations see \ufb01ne re\ufb01nements to the pattern. We\nalso report the mean (across pixels) entropy of the underlying distribution I(n) for each pattern. Note\nthat, as expected, this entropy decreases across iterations as the distributions I(n) evolve from being\nsoft selections of color channels, to zero-one vectors that make hard ordinal decisions.\n\nTable 1: Median Reconstruction PSNR (dB) using Traditional demosaicking and Proposed Network\n\nBayer\n\nCFZ [4]\n\nNoise STD=0.0025 Noise STD=0.01\n\nNoise STD=0.0025 Noise STD=0.01\n\nTraditional\n\nNetwork\n\n42.69\n47.55\n\n32.44\n43.72\n\n48.84\n49.08\n\n39.55\n44.64\n\npattern is being learned jointly with our proposed reconstruction architecture, it is important to\ndetermine whether this architecture can learn to reason effectively with different kinds of sensor\npatterns, which is necessary to effectively cover the joint sensor-inference design space.\n\nWe compare our learned networks to Zhang and Wu\u2019s method [27] for the Bayer pattern, and\nChakrabarti et al.\u2019s method [4] for their own pattern. We measure performance in terms of the\nreconstruction PSNR of all non-overlapping 64 \u00d7 64 patches from all test images (roughly 40,000\npatches). Table 1 compares the median PSNR values across all patches for reconstructions using our\nnetwork to those from traditional methods, at two noise levels\u2014low noise corresponding to an STD\nof 0.0025, and moderate noise corresponding to 0.01. For the pattern of [4], we \ufb01nd that our network\nperforms similar to their reconstruction method at the low noise level, and signi\ufb01cantly better at the\nhigher noise level. On the Bayer pattern, our network achieves much better performance at both noise\nlevels. We also note here that reconstruction using our network is signi\ufb01cantly faster\u2014taking 9s on a\nsix core CPU, and 200ms when using a Titan X GPU, for a 2.7 mega-pixel image. In comparison, [4]\nand [27]\u2019s reconstruction methods take 20s and 1 min. respectively on the CPU.\n\n4.2 Visualizing Sensor Pattern Training\n\nIn Fig. 2, we visualize the evolution of our sensor pattern during the training process, while it is being\njointly learned with the reconstruction network. In the initial iterations, the sensor layers displays a\npreference for densely sampling the RGB channels, with very few panchromatic measurements\u2014in\nfact, in the \ufb01rst row of Fig. 2, we see panchromatic pixels switching to color measurements. This\n\n6\n\n\fFigure 3: Example reconstructions from (noisy) measurements with different sensor multiplexing\npatterns. Best viewed at higher resolution in the electronic version.\n\nis likely because early on in the training process, the reconstruction network hasn\u2019t yet learned to\nexploit cross-channel correlations, and therefore needs to measure the output channels directly.\n\nHowever, as training progresses, the reconstruction network gets more sophisticated, and we see the\nnumber of color measurements get sparser and sparser, in favor of panchromatic pixels that offer the\nadvantage of higher SNR. Essentially, the sensor layer begins to adopt one of the design principles of\n[4]. However, it distributes the color measurement sites across the pattern, instead of concentrating\nthem into separated blocks like [4]. In the last 500K iterations, we see that most changes correspond\nto \ufb01ne re\ufb01nements of the pattern, with a few individual pixels swapping the channels they measure.\n\nWhile the patterns themselves in Fig. 2 correspond to the channel at each pixel with the maximum\nvalue in the selection map I(n), remember that these maps themselves are soft. Therefore, we also\nreport the mean entropy of the underlying I(n) for each pattern in Fig. 2. We see that this entropy\ndecreases across iterations, as the choice of color channel for more and more pixels becomes \ufb01xed,\nwith their distributions in I(n) becoming peakier and closer to being zero-one vectors.\n\n4.3 Evaluating Learned Pattern\n\nFinally, we evaluate the performance of neural network-based reconstruction from measurements\nwith our learned pattern, to those with the Bayer pattern and the pattern of [4]. Table 2 shows\ndifferent quantiles of reconstruction PSNR for various noise levels, with noise STDs raning from\n0 to 0.04. Even though our sensor pattern was trained at the noise level of STD=0.01, we \ufb01nd it\nachieves the highest reconstruction quality over a large range of noise levels. Speci\ufb01cally, it always\noutperforms the Bayer pattern, by fairly signi\ufb01cant margins at higher noise levels. The improvement\nin performance over [4]\u2019s pattern is less pronounced, although we do achieve consistently higher\nPSNR values for all quantiles at most noise levels. Figure 3 shows examples of color patches\nreconstructed from our learned sensor, and compare these to those from the Bayer pattern and [4].\n\nWe see that the reconstructions from the Bayer pattern are noticeably worse. This is because it\nmakes lower SNR measurements, and the reconstruction networks learn to smooth their outputs to\nreduce this noise. Both [4] and our pattern yield signi\ufb01cantly better reconstructions. Indeed, most of\nour gains over the Bayer pattern come from choosing to make most measurements panchromatic, a\ndesign principle shared by [4]. However, remember that our sensor layer learns this principle entirely\nautomatically from data, without expert supervision. Moreover, we see that [4]\u2019s reconstructions\ntend to have a few more instances of \u201cchromaticity noise\u201d, in the form of contiguous regions with\nincorrect hues, which explain its slightly lower PSNR values in Table 2.\n\n7\n\n\fTable 2: Network Reconstruction PSNR (dB) Quantiles for various CFA Patterns\n\nNoise STD\n\nPercentile\n\nBayer [2]\n\nCFZ [4]\n\nLearned\n\n0\n\n0.0025\n\n0.0050\n\n0.0075\n\n0.0100\n\n0.0125\n\n0.0150\n\n0.0175\n\n0.0200\n\n0.0300\n\n0.0400\n\n25%\n50%\n75%\n25%\n50%\n75%\n25%\n50%\n75%\n25%\n50%\n75%\n25%\n50%\n75%\n25%\n50%\n75%\n25%\n50%\n75%\n25%\n50%\n75%\n25%\n50%\n75%\n25%\n50%\n75%\n25%\n50%\n75%\n\n47.62\n51.72\n54.97\n44.61\n47.55\n50.52\n42.55\n45.63\n48.73\n41.34\n44.48\n47.77\n40.58\n43.72\n47.10\n40.29\n43.36\n46.65\n39.97\n43.03\n46.25\n39.60\n42.62\n45.82\n39.31\n42.39\n45.56\n38.18\n41.17\n44.23\n37.14\n39.98\n43.17\n\n48.04\n52.17\n55.32\n46.05\n49.08\n51.57\n44.33\n47.01\n49.68\n42.92\n45.60\n48.41\n41.97\n44.64\n47.56\n41.17\n43.88\n47.04\n40.54\n43.29\n46.69\n40.03\n42.83\n46.39\n39.49\n42.39\n46.14\n38.31\n41.48\n45.61\n37.43\n40.86\n45.11\n\n47.97\n52.12\n55.30\n46.08\n49.17\n51.76\n44.37\n47.19\n49.94\n43.08\n45.85\n48.69\n42.16\n44.94\n47.80\n41.41\n44.22\n47.27\n40.85\n43.69\n46.86\n40.31\n43.12\n46.45\n39.96\n42.78\n46.23\n38.92\n41.85\n45.63\n38.00\n41.02\n44.98\n\n5 Conclusion\n\nIn this paper, we proposed learning sensor design jointly with a neural network that carried out\ninference on the sensor\u2019s measurements, speci\ufb01cally focusing on the problem of \ufb01nding the optimal\ncolor multiplexing pattern for a digital color camera. We learned this pattern by joint training with\na neural network for reconstructing full color images from the multiplexed measurements. We\nused a soft-max operation with an increasing temperature parameter to model the non-differentiable\ncolor channel selection at each point, which allowed us to train the pattern effectively. Finally,\nwe demonstrated that our learned pattern enabled better reconstructions than past designs. An\nimplementation of our method, along with trained models, data, and results, is available at our project\npage at http://www.ttic.edu/chakrabarti/learncfa/.\n\nOur results suggest that learning measurement strategies jointly with computational inference is both\nuseful and possible. In particular, our approach can be used directly to learn other forms of optimized\nmultiplexing patterns\u2014e.g., spatio-temporal multiplexing for video, viewpoint multiplexing in light-\n\ufb01eld cameras, etc. Moreover, these patterns can be learned to be optimal for inference tasks beyond\nreconstruction. For example, a sensor layer jointly trained with a neural network for classi\ufb01cation\ncould be used to discover optimal measurement strategies for say, distinguishing between biological\nsamples using multi-spectral imaging, or detecting targets in remote sensing.\n\nAcknowledgments\n\nWe thank NVIDIA corporation for the donation of a Titan X GPU used in this research.\n\n8\n\n\fReferences\n\n[1] R. G. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 2007.\n[2] B. E. Bayer. Color imaging array. US Patent 3971065, 1976.\n[3] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks\n\ncompete with BM3D? In Proc. CVPR, 2012.\n\n[4] A. Chakrabarti, W. T. Freeman, and T. Zickler. Rethinking color cameras. In Proc. ICCP, 2014.\n[5] M. Elad. Optimized projections for compressed sensing. IEEE Trans. Sig. Proc., 2007.\n[6] P. V. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp. Bayesian color constancy revisited.\n\nIn Proc. CVPR, 2008.\n\n[7] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. arXiv:1510.00149, 2015.\n\n[8] J. Holloway, A. C. Sankaranarayanan, A. Veeraraghavan, and S. Tambe. Flutter shutter video\n\ncamera for compressive sensing of videos. In Proc. ICCP, 2012.\n\n[9] T. Kaltzer, K. Hammernik, P. Knobelreiter, and T. Pock. Learning joint demosaicing and\n\ndenoising based on sequential energy minimization. In Proc. ICCP, 2016.\n\n[10] O. Kapah and H. Z. Hel-Or. Demosaicking using arti\ufb01cial neural networks. In Electronic\n\nImaging, 2000.\n\n[11] D. Khashabi, S. Nowozin, J. Jancsary, and A. W. Fitzgibbon. Joint demosaicing and denoising\n\nvia learned nonparametric random \ufb01elds. IEEE Trans. Imag. Proc., 2014.\n\n[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[13] Y. LeCun, L. Bottou, G. Orr, and K. Muller. Ef\ufb01cient backprop. In Neural Networks: Tricks of\n\nthe trade. Springer, 1998.\n\n[14] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth from a conventional\n\ncamera with a coded aperture. In ACM Transactions on Graphics (TOG), 2007.\n\n[15] X. Li, B. Gunturk, and L. Zhang. Image demosaicing: A systematic survey. In Proc. SPIE,\n\n2008.\n\n[16] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\n\nIn Proc. CVPR, 2015.\n\n[17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image\n\nrestoration. In Proc. ICCV, 2009.\n\n[18] R. Raskar, A. Agrawal, and J. Tumblin. Coded exposure photography: motion deblurring using\n\n\ufb02uttered shutter. ACM Transactions on Graphics (TOG), 2006.\n\n[19] C. J. Schuler, H. C. Burger, S. Harmeling, and B. Scholkopf. A machine learning approach for\n\nnon-blind image deconvolution. In Proc. CVPR, 2013.\n\n[20] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated\nrecognition, localization and detection using convolutional networks. arXiv:1312.6229, 2013.\n[21] L. Shi and B. Funt. Re-processed version of the Gehler color constancy dataset of 568 images.\n\n2010. Accessed from http://www.cs.sfu.ca/~colour/data/.\n\n[22] J. Sun and M. F. Tappen. Separable markov random \ufb01eld model and its applications in low level\n\nvision. IEEE Trans. Imag. Proc., 2013.\n\n[23] A. Veeraraghavan, R. Raskar, A. Agrawal, A. Mohan, and J. Tumblin. Dappled photography:\n\nMask enhanced cameras for heterodyned light \ufb01elds and coded aperture refocusing. 2007.\n\n[24] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In\n\nProc. CVPR, 2015.\n\n[25] A. E. Waters, A. C. Sankaranarayanan, and R. Baraniuk. Sparcs: Recovering low-rank and\n\nsparse matrices from compressive measurements. In NIPS, 2011.\n\n[26] L. Xu, J. S. Ren, C. Liu, and J. Jia. Deep convolutional neural network for image deconvolution.\n\nIn NIPS, 2014.\n\n[27] L. Zhang and X. Wu. Color demosaicking via directional linear minimum mean square-error\n\nestimation. IEEE Trans. Imag. Proc., 2005.\n\n[28] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image\n\nrestoration. In Proc. ICCV, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1529, "authors": [{"given_name": "Ayan", "family_name": "Chakrabarti", "institution": "TTI Chicago"}]}