{"title": "A Step Toward Quantifying Independently Reproducible Machine Learning Research", "book": "Advances in Neural Information Processing Systems", "page_first": 5485, "page_last": 5495, "abstract": "What makes a paper independently reproducible? Debates on reproducibility center around intuition or assumptions but lack empirical results. Our field focuses on releasing code, which is important, but is not sufficient for determining reproducibility. We take the first step toward a quantifiable answer by manually attempting to implement 255 papers published from 1984 until 2017, recording features of each paper, and performing statistical analysis of the results. For each paper, we did not look at the authors code, if released, in order to prevent bias toward discrepancies between code and paper.", "full_text": "A Step Toward Quantifying Independently\nReproducible Machine Learning Research\n\nEdward Raff\n\nBooz Allen Hamilton\n\nraff_edward@bah.com\n\nUniversity of Maryland, Baltimore County\n\nraff.edward@umbc.edu\n\nAbstract\n\nWhat makes a paper independently reproducible? Debates on reproducibility cen-\nter around intuition or assumptions but lack empirical results. Our \ufb01eld focuses\non releasing code, which is important, but is not suf\ufb01cient for determining repro-\nducibility. We take the \ufb01rst step toward a quanti\ufb01able answer by manually attempt-\ning to implement 255 papers published from 1984 until 2017, recording features\nof each paper, and performing statistical analysis of the results. For each paper,\nwe did not look at the authors code, if released, in order to prevent bias toward\ndiscrepancies between code and paper.\n\n1\n\nIntroduction\n\nAs the \ufb01elds of Arti\ufb01cial Intelligence (AI) and Machine Learning (ML) have grown in recent years,\nso too have calls that we are currently in an AI/ML reproducibility crisis [1]. Conferences, such as\nNeurIPS, have added reproducibility as a factor in the reviewing cycle or implemented policies to\nencourage code sharing. Many are pursing work centered around code and data availability as one\nof the more direct methods of enhancing reproducibility. For example, Dror et al. [2] developed a\nproposal to standardize the description and release of datasets. Others have proposed taxonomies\nand ontologies over reproducibility based on the availability of algorithm description, code, and\ndata [3, 4]. Others have focused on building frameworks for sharing code and automation of hyper\nparameter selection in order to enable easier reconstruction of results [5].\nWhile the ability to replicate the results of papers through open sourced code and data is valuable\nand should be lauded, it has been argued that releasing code is insuf\ufb01cient [6]. The inability to\nreproduce results without code availability may suggest problems with the paper. This may be due to\nthe following: insuf\ufb01cient explanation of the approach, failure to describe important minute details,\nor a discrepancy between the code and description. We will call the act of reproducing the results of a\npaper without use of code from the paper\u2019s authors, independent reproducibility. We argue that for a\npaper to be scienti\ufb01cally sound and complete, it should be independently reproducible.\nThe question we wish to answer in this work is what makes a paper independently reproducible?\nMany have argued \ufb01ercely for different aspects of writing and publishing as critical factors of\nreproducability. Quanti\ufb01able study of these efforts is needed to advance the conversation. Otherwise,\nwe as a community will not have scienti\ufb01c understanding that our work is addressing aspects of\nreproducibility. Gundersen and Kjensmo [7] de\ufb01ned several paper-properties of interest in regard to\nreproducibility. However, they de\ufb01ned a paper as reproducible purely as a function of the features\nwithout knowing if the selected features (e.g., method is described, data is available) actually impact\na paper\u2019s reproducibility.\nAs a \ufb01rst step toward answering this question, we performed a study of 255 papers that we have\nattempted to implement independently. We developed the \ufb01rst empirical quanti\ufb01cation about indepen-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdent reproducibility by recording features from each paper and reproduction outcome. We will review\nthe entire procedure and features obtained in section 2. In section 3 we will discuss which features\nwere determined to be statistically signi\ufb01cant, and we will discuss the implication of these results.\nWe will discuss the de\ufb01ciencies of our study in section 4, with subjective analysis in section 5, and\nthen conclude in section 6.\n\n2 Procedure and Features\n\nFor clarity, we will refer to ourselves, the author of this paper, as the reproducers, distinct from\nthe authors of the papers we attempt to independently reproduce. To perform our analysis, we\nobtained features from 255 papers. Inclusion criteria included papers that proposed at least one new\nalgorithm/method that is the subject of reproduction, and papers where the \ufb01rst implementation and\nreproduction attempts occurred between January 1st 2012 through December 31st 2017. We chose\nvaried paper topics based on our historical interest. No papers were included from 2018 to present, as\nsome papers take more time to reproduce than others, which could negatively skew results for papers\nfrom the past year. If the available source code for a paper under consideration was seen before\nhaving successfully reproduced the paper, we excluded the paper from this analysis because at that\npoint we are not a fully independent party. In line with this, any paper was excluded if the paper\u2019s\nauthors had any signi\ufb01cant relationship with the reproducers (e.g., academic advisor, co-worker, close\nfriends, etc.) because intimate knowledge of communication style, work preferences, or the ability to\nhave more regular communication could bias results. A paper was considered to be reproduced if the\ncode for results were written by the reproducers, allowing the use of reasonable and standard libraries\n(e.g., BLAS, PyTorch, etc.), and the code reproduced the majority of claims from the paper.\nSpeci\ufb01cally, we regarded a paper as reproducible if the majority (75%+) of the claims in the paper\ncould be con\ufb01rmed with code we independently wrote. If a claimed improvement was measured in\norders-of-magnitude, being within the same order-of-magnitude was considered suf\ufb01cient (e.g., a\npaper claims 700x faster, but reproducers observe 300x). This same order-of-magnitude criterion\ncomes from an observation that such claims are highly dependent upon constant factor ef\ufb01ciency\nimprovements that may be had/missing from both the prior methods, and the proposed method being\nreplicated. Presence or absence of these improvements can cause, apparently, \u201cdramatic\u201d impacts\nwithout fundamentally changing the nature of the contribution we are attempting to reproduce. When\ncompared to other algorithms, we consider a paper reproduced if the considerable majority (90%+)\nof the new algorithm\u2019s rankings correspond to those found in the paper (e.g., the claim is that the\nproposed method was most accurate on 95% of tasks compared to 4 other models, we want to see our\nreproduction be most accurate on at least 95% \u00b7 90% = 81% of the same tasks, compared to the same\nmodels). As a last resort, we considered getting within 10% of the numbers reported in the paper (or\nbetter), or in the case of non-quantitative results (e.g., GAN sample quality), we subjectively compare\nour results with the paper to make a decision. We include this \ufb02exibility in speci\ufb01cation to allow for\nsmall differences that can occur. While not common, we did encounter more than one instance where\nour independent reproduction achieved better results than the original paper.\nAfter this selection process, we are left with 255 papers, of which 162 (63.5%) were successfully\nreplicated and 93 were not. We note that this is signi\ufb01cantly better than the 26% reproducibility\ndetermined by [7], who de\ufb01ned reproducibility as a function of the features they believed would\ndetermine reproduction. Below we will describe each of the features used. We attempt to catalog both\nfeatures that are believed relevant to a paper\u2019s reproduction and features that should not be relevant,\nwhich will help us quantify if these expectations hold. We will use statistical tests to determine\nwhich of these features have a signi\ufb01cant relationship with reproduction. An anonymized version\nof the data can be found at https://github.com/EdwardRaff/Quantifying-Independently-\nReproducible-ML.\n\n2.1 Quanti\ufb01ed Features\n\nWe have manually recorded 26 attributes from each paper, which took approximately 20 minutes per\npaper to complete1. A policy for each feature was developed to minimize as much subjectivity as\npossible. Below we will review each feature, and how they were recorded, in order from least to\n\n1Not done in a continuous run. Feature collection, and paper selection, and total time preparing the study\n\ndata took approximately 6 months.\n\n2\n\n\fmost subjective. Each feature was obtained based on the body of the main paper only, excluding any\nappendices (unless speci\ufb01ed otherwise).\nFeatures to consider were selected based on two factors: 1) would one reasonably believe the feature\nshould be correlated with the ability to reproduce a paper (positive or negative) and 2) was the feature\nreasonably available with little additional work? This was done to capture as much useful information\nas possible while also avoiding limiting our study to items where a priori one might believe that a\nfeature\u2019s relevance (or lack thereof) to be \u201cobvious.\u201d\nUnambiguous Features: Some features are not ambiguous in nature. A few are simple and innate\nproperties that require no explanation. This included the Number of Authors, the existence of an\nappendix (or supplementary material), the number of pages (including references, excluding any\nappendix), the number of references, the year the paper was published, the year \ufb01rst attempted to\nimplement, the venue type (Book, Journal, Conference, Workshop, Tech-Report), as well as the\nspeci\ufb01c publication venue (e.g., NeurIPS, ICML). Many papers follow a progression from Tech-\nReport to Workshop to Conference to Journal as the paper becomes more complete. For any paper\nthat participated in parts of this progression, we use the version from the most \u201ccomplete\u201d venue\nunder the assumption that it would be the most reproducible version of the paper allowing us to avoid\nissues with double-counting papers.\nWe also include whether or not the Author Replied to questions about their paper. If any author\nreplied to any email, it was counted as a \u201cYes\u201d. If no author ever replied, we marked it as \u201cNo.\u201d In all\ncases, every paper author was sent an email before marking it as \u201cNo.\u201d If a current email could not be\nfound, we marked that the authors were not contacted.\nMild Subjectivity: We spend more time expounding on the next set of features, which had minor\ndegrees of subjectivity. We state below the developed procedure we used to make their quanti\ufb01cation\npractical and reproducible.\n\u2022 Number of Tables: The total number of tables in the paper, regardless of the content of those tables.\nWhile tables usually contain results, they often contain a wide variety of content, and we make no\ndistinction between them due to their frequency and variety.\n\u2022 Number of Graphs/Plots: The total number of plots/graphs contained in the paper which includes\n\u2022 Number of Equations: Due to differing writing styles, we do not use equation number provided\nby the paper, nor do we count everything that might be typed between LaTeX \u201c$$\u201d brackets. We\nmanually reviewed every line of every paper to arrive as a consistent counting process2. Inline\nmathematics were only counted if the the math involved 1) two or more variables interacting (e.g.,\nx \u00b7 y) or 2) two or more \u201coperations\u201d (e.g, P (x|y) or O(x2)). If only one \u201coperation\u201d occurred\n(e.g, P (x) or x2), it was not considered. Inline equations were counted only once per line of text,\nregardless of how many equations occurred in a line of text. Whole-line equations were always\ncounted, regardless of the simplicity of the equation. If multiple whole lines were used because of\nequation length (e.g., a \u201c+\u201d ), it was counted as one equation. If multiple whole lines were used\ndue to showing a mathematical step or derivation, each step counted as an additional equation.\nPartial deference was given to equation numbers. If every line of an equation received its own\nnumber, they were counted accordingly. If a derivation over n whole lines received only one\nequation number, the equation was counted (cid:100)n/3(cid:101) times.\n\u2022 Number of Proofs: A proof was only counted if it was done in a formal manner, beginning with the\nstatement of a corollary or theorem, and included at least an overview of how to achieve the proof.\nA proof was counted if it occurred in the appendix or supplementary material. Derivations of\nupdate rules or other equations did not count as a proof unless the paper stated them as a proof. This\nwas done as a practical matter in reducing ambiguity and the process of collecting the information.\n\u2022 Exact Compute Speci\ufb01ed: If a paper indicated any of the speci\ufb01c compute resources used (e.g.,\nCPU GHz speed or model number, GPU model, number of computers used), we considered it to\nhave satis\ufb01ed this requirement.\n\u2022 Hyper-parameters Speci\ufb01ed: If a paper speci\ufb01ed the \ufb01nal hyper-parameter values selected for each\ndataset or the method of selecting hyper-parameters (e.g., cross validation factor) and the value\nrange (e.g., \u03bb \u2208 [1, 1000]), we consider it to have satis\ufb01ed this requirement. Simply stating that\na grid-search (or similar procedure) was used was not suf\ufb01cient. If a paper introduced multiple\n\nscatter plots, bar-charts, contour-plots, or any other kind of 2D-3D numeric data visualization.\n\n2Not all papers have LATEX available, and older papers are often scanned making automation dif\ufb01cult.\n\n3\n\n\fhyper-parameters but only speci\ufb01ed how a sub-set of the parameters where chosen, we marked it\nas \u201cPartial\u201d.\n\u2022 Compute Needed: We de\ufb01ned the compute level needed to reproduce a paper\u2019s results as needing\neither a Desktop (i.e., \u2264 $2000), a consumer GPU (e.g., an Nvida Geforce type card), a Server\n(used 20 cores or more, or 64 GB of RAM or more), or a Cluster. If the compute resources needed\nwere not explicitly stated, this was subjectively based on the computational complexity of the\napproach and amount of experiments believed necessary to reach reproduction. We stress that this\ncompute level was selected based on today\u2019s common compute resources, not those available at\nthe time of the paper\u2019s publication.\n\u2022 Data Available: If any of the datasets used in the paper are publicly available, we note it as having\n\u2022 Pseudo Code: We allow for four different options for this feature: 1) no pseudo code is given in\nthe paper, 2) \u201cStep-Code\u201d is given, where the paper outlines the algorithm/method as a sequence of\nsteps, but the steps are terse and high-level or refer to other parts of the paper for details, 3) \u201cYes\u201d,\nthe paper has some pseudo code which outlines the algorithm at a high level but with suf\ufb01cient\ndetail that it feels mostly complete, and 4) \u201cCode-Like\u201d, the paper summarizes the approach in\ngreat detail that is reminiscent of reading code (or is in fact code ).\n\nsatis\ufb01ed this requirement.\n\nSubjective: We have a \ufb01nal set of features which we recognize are of a signi\ufb01cantly subjective nature.\nFor all of these features, we are aware there may be signi\ufb01cant issues, and in practice, any alternative\nprotocol would impose its own different set of issues. We have made the choices in an attempt to\nminimize as many issues as possible and make the survey possible. Below is the protocol we followed\nto reduce ambiguity and make our procedure as reproducible as possible for future studies, which\nwill help the reader fully understand our interpetation of the results.\n\u2022 Number of Conceptualization Figures: Many papers include graphics or content for which the\npurpose is not to convey a result, but to try to convey the idea / method proposed itself. These\nare usually included to make it easier to understand the algorithm, and so we identify them as a\nseparate item to count.\n\u2022 Uses Exemplar Toy Problem: As a binary \u201cYes\u201d/\u201cNo\u201d option, did the paper include an exemplar\ntoy problem? These problems are not meaningful toward any application of the algorithm, but\nthey are devised to show speci\ufb01c behaviors or create demonstrations that are easier to reproduce /\nhelp conceptualize the algorithm being presented. These are often 2D or 3D problems, or they are\nsynthetically generated from some speci\ufb01ed set of distributions.\n\u2022 Number of Other Figures: This was a catch-all class for any \ufb01gure that was not a Graph/Plot, Table,\nor Conceptualization Figure as de\ufb01ned above. For most papers, this included samples of the output\nproduced by an algorithm or example input images for Computer Vision applications.\n\u2022 Rigor vs Empirical: There have been a number of calls for more scienti\ufb01c rigor within the ML\ncommunity[8], with many arguing that an overly empirical focus may in fact slow down progress\n[9]. We are not aware of any agreed upon taxonomy of what makes a paper \u201crigorous\u201d. Based on\nthe interpretation that rigor equates to having grounded understanding of why and how our methods\nwork, beyond simply showing that they do so empirically, we develop the following protocol: a\npaper is classi\ufb01ed as \u201cTheory\u201d (read, rigorous), if it has formal proofs, provides mathematical\nreasoning or explanation to modeling decisions, or provides mathematical reasoning or explanation\nto why prior methods fail on some dataset. By default, we classify all other papers as \u201cEmpirical.\u201d\nHowever, if a \u201cTheory\u201d paper also includes discussion of practical implementation or deployment\nconcerns, complete discussion of hyper-parameter setting such that there is no ambiguity, ablation\nstudies of decisions made, or experiments on production datasets, we consider the paper \u201cBalanced\u201d\nas having both theory and empirical components.\n\u2022 Paper Readability: We give each paper a readability score of \u201cLow\u201d, \u201cOk\u201d, \u201cGood\u201d, or \u201cExcellent.\u201d\nTo minimize subjectivity in these scores, we tie each to the amount of times we had to read the\npaper in order to reach a point where we felt we had the proposed algorithm implemented in its\nentirety, and the failure to replicate would be a matter of \ufb01nding and removing bugs. The score\nof \u201cExcellent\u201d means that we needed to read the paper only once to produce an implementation,\n\u201cGood\u201d papers needed two or three readings, \u201cOk\u201d papers needed four or \ufb01ve, with \u201cLow\u201d being\nsix or more reads through the paper3.\n\u2022 Algorithm Dif\ufb01culty: We categorize the dif\ufb01culty of implementing an algorithm as either \u201cLow\u201d,\n\u201cMedium\u201d, or \u201cHigh.\u201d We grounded this to lines of code for any paper successfully implemented\n\n3This information was obtained from our own record keeping over time and paper-organizing software\n\n4\n\n\for which made its implementation available online. For ones never successfully implemented and\nwithout code, we estimated this based on our intuition and experience on where the implementation\nwould have landed based on reading the paper. \u201cLow\u201d dif\ufb01culties could be completed in 500 lines\nof code or less, \u201cMedium\u201d dif\ufb01culty between 500 and 1,500 lines, and \u201cHigh\u201d was > 1,500 lines.\nIn these numbers we assume using common libraries (e.g., auto-differentiation, BLAS, etc.).\n\u2022 Primary Topic: For each paper we tried to specify a single primary topic of the paper. Many\npapers cover different aspects of multiple problems, making this a challenge. We adjusted topics\ninto higher-level categories so that each topic had at least three members, so that we could do\nmeaningful statistics. Topics can be found in the appendix.\n\n\u2022 Looks Intimidating: The most subjective, does the paper \u201clook intimidating\u201d at \ufb01rst glance?\n\n3 Results\n\nTable 1: Signi\ufb01cance test of which paper\nproperties impact reproducibility. Results sig-\nni\ufb01cant at \u03b1 \u2264 0.05 marked with\u201c*\u201d.\n\nOur features are either numeric or categorical. For\neach numeric feature (except the number of pages\nand number of authors), we normalized the value by\nthe number of pages in the paper. Longer papers nat-\nurally have more space to include more equations,\n\ufb01gures, etc., and this was done to make all papers\nmore directly comparable. For numeric features we\nused the non-parametric Mann\u2013Whitney U [10] test\nto determine signi\ufb01cance. A Shapiro-Wilk test of nor-\nmality [11] con\ufb01rmed that none of our features would\nhave been appropriate for use with a Student\u2019s t-test\nand so the non-parametric testing is preferred. For all\ncategorical features, we used a Chi-Squared test [12]\nwith continuity correction [13]. In our analysis we\nwill also examine relationships between some of our\ncategorical features and other numeric features for\nsuspected relationships. We will continue to use non-\nparametric tests for robustness/conservative estimates\nof signi\ufb01cance, relying on the Kruskal-Walls [14] for\nANOVA testing and the Dunn test [15] for post-hoc\nanalysis. JASP was used to compute all statistical\ntests [16]. In Table 1 we show the results for decid-\ning which of our 26 features were correlated with a\npaper\u2019s reproducibility. Tables and graphs of all the\nfeatures are too numerous to \ufb01t in the main paper, and\nwill be found in the appendix.\nWe begin by noting that the year a paper was pub-\nlished or the year that we \ufb01rst tried to implement the\npaper were not correlated with successful reproduc-\ntion. The concerns of a reproducibility crisis would\ngenerally imply that the issue is a recent one. How-\never, the year a paper was published is not correlated\nwith successful reproduction, with the oldest paper being from 1984. This would suggest that indepen-\ndent reproducibility has not changed over time. Depending on one\u2019s perspective, we could argue that\nthere is not a reproducibility crisis, or that one has been ongoing for several decades. It is important\nthe reader qualify this statistical result with the fact that the year of paper publication in our study is\nnot evenly distributed over time, with the majority of papers occurring in between 2000 through 2017.\nTo our study\u2019s bene\ufb01t, the year \ufb01rst attempted for reproduction was not signi\ufb01cant. If our success\nwas correlated with time (as one might expect in advance\u2014 with skill increasing with experience),\nwe would worry about this skewing our results. This appears to not be an issue, removing a potential\nproblem from our results.\n\nFeature\nYear Published\nYear First Attempted\nVenue Type\nRigor vs Empirical*\nHas Appendix\nLooks Intimidating\nReadability*\nAlgorithm Dif\ufb01culty*\nPseudo Code*\nPrimary Topic*\nExemplar Problem\nCompute Speci\ufb01ed\nHyperparameters Speci\ufb01ed*\nCompute Needed*\nAuthors Reply*\nCode Available\nPages\nPublication Venue\nNumber of References\nNumber Equations*\nNumber Proofs\nNumber Tables*\nNumber Graphs/Plots\nNumber Other Figures\nConceptualization Figures\nNumber of Authors\n\np-value\n0.964\n0.674\n0.631\n1.55 \u00d7 10\u22129\n0.330\n0.829\n9.68 \u00d7 10\u221225\n2.94 \u00d7 10\u22125\n2.31 \u00d7 10\u22124\n7.039 \u00d7 10\u22124\n0.720\n0.257\n8.45 \u00d7 10\u22126\n8.75 \u00d7 10\u22125\n6.01 \u00d7 10\u22128\n0.213\n0.364\n0.342\n0.740\n0.004\n0.130\n0.010\n0.139\n0.217\n0.365\n0.497\n\n5\n\n\f3.1 Signi\ufb01cant Relationships\n\nPseudo Code\nNo\n\nStep-Code\n\nThere were ten variables that are signi\ufb01cantly correlated with a paper\u2019s reproducibility. Of them,\nNumber of Tables, Equations, Compute Needed, Pseudo-Code, and Hyper-parameters Speci\ufb01ed are\nthe least subjective variables which were signi\ufb01cant.\nReadability had the strongest empirical relationship, which on its face is not surprising. Note that\nby our de\ufb01nition, Readability corresponds to how many reads through the paper were necessary to\nget to a mostly complete implementation. As expected, the fewer attempts to read through a paper,\nthe more likely it was to be reproduced. For \u201cExcellent\u201d papers, we were always able to reproduce\nresults. Exact counts can be found in Table 11. Based on these results we argue the importance of\nclear and effective communication of implementation details, which may often be neglected. This\nneglect may come from forced page limits or a preference towards other, competing factors (e.g.,\npreferring \ufb01gures/results that better show the method\u2019s value, at the cost of method details). We\nsuspect that a factor in this is page limits which we test by proxy via paper page length. A Krusal-\nWallis test con\ufb01rms the signi\ufb01cance of paper length in pages (p = 0.035). A Dunn post-hoc test\nshows that \u201cLow\u201d Readability papers are the statistically signi\ufb01cant source of this relationship, which\nare 3.17\u20135.67 pages shorter than the other Readability types. As a \ufb01eld that has historically focused\non open-access and online availability, and with the decreasing relevance of paper conference and\njournal distributions, our study suggests that raising page limits on papers, and adding technical\nalgorithmic details as an explicit review factor, could aid in increasing the reproducibility of papers.\nTable 2: Relationship between use of Pseudo Code and Readability\nTwo factors we expected to be\nrelated to a paper\u2019s Readabil-\nity (as we have de\ufb01ned it) are\nthe Algorithm\u2019s Dif\ufb01culty and\nthe presence of Pseudo-Code,\nboth of which are signi\ufb01cant\nfactors and have a statistically\nsigni\ufb01cant relationship with\nReadability. If we look at Ta-\nble 12, we see that Pseudo-\nCode has a complicated re-\nlationship with Reproduction.\nHighly Detailed \u201cCode-Like\u201d descriptions are more reproducible, but having \u201cNo\u201d pseudo-code is\nalso positively related with reproduction. Based on these results papers which can effectively describe\ntheir algorithms without pseudo-code are communicating the information in another way, but papers\nwith \u201cStep-Code\u201d do an inadequate job at this task. Examining the relationship between Pseudo-\nCode and Readability in Table 2 supports this, where we see that using Step-Code is biased toward\nlower readability. This also makes sense in abstract, as step-code often requires one to repeatedly\nreference different parts of a paper. The relationship between an Algorithm\u2019s dif\ufb01culty is more direct\nand intuitive, Table 10 showing that reproducibility decreases with dif\ufb01culty.\nIt is also interesting to note how Rigor vs Empirical is correlated with reproducibility. One may have\nexpected papers that focus on proving their methods correct would be the most reproducible. In\nTable 8 we can see that papers that are \u201cEmpirical\u201d or \u201cBalanced\u201d both have higher than expected\nreproduction rates, while \u201cTheory\u201d oriented papers have lower than expected. These results would\nseem to suggest that empiricism is intrinsically valuable for reproduction on the micro scale of\nindividual papers. This does not contradict any of the concerns about long-term behaviors and results\nthat are side effects of overly-empirical issues discussed by Sculley et al. [9], such as new methods\nbeing inappropriately considered due to ineffectively tuned baselines and lack of ablation studies. We\ntake this result as a further indication that rigor cannot just be math or learning bounds for their own\nsake, but that the practical relevance and execution of any theorems must be at the forefront in all\npapers4.\nUnfortunately, the primary topic of a paper was found to be a signi\ufb01cant factor for independent\nreproducibility. We were not able to reproduce any Bayesian or Fairness based papers. We had a\nhigher-than expected success in implementing papers about Deep Learning and Search/Retrieval.\nWe, the reproducers, are not experts in all of the primary topic areas listed, and so we advise against\n\nOk Good Excellent\n24.00\n13.94\n6.00\n10.06\n14.00\n18.00\n1.00\n3.00\n\nActual\nExpected\nActual\nExpected\nActual\nExpected\nActual\nExpected\n\nLow\n22.00\n23.24\n29.00\n16.76\n21.00\n30.00\n3.00\n5.00\n\nYes\n\nCode-Like\n\nPaper Readability\n\n10.00\n17.66\n15.00\n12.74\n28.00\n22.80\n4.00\n3.80\n\n23.00\n24.16\n7.00\n17.44\n39.00\n31.20\n9.00\n5.20\n\n4This would not apply for pure theory papers, and we remind the reader that all papers in this study proposed\n\nand evaluated some new algorithm, and thus do not fall into a pure theory category.\n\n6\n\n\fextrapolation from this particular result. This leads to interesting questions regarding reproduction\nfrom inside/outside an expert peer group and when one quali\ufb01es as an expert in a general topic area.\nWe hope to explore these questions further in future work.\nBoth Number of Tables and Hyper-parameters were positively correlated with reproducibility. The\nmore tables included in a paper, or the more parameters speci\ufb01ed, the more likely the paper was to be\nreproducible. This is not a surprising result for the Hyper-parameters case, and supports the emphasis\nthe community has placed on this factor [5]. It is somewhat peculiar that Tables are signi\ufb01cant, but\nGraphs/Plots are not as both convey primarily numeric information to the reader. We suspect that the\nability for the reader to quickly understand the exact value/result from a table is the differentiating\nfactor as it gives a target to meet and measure against. While a plot/graph may describe overall\nbehavior, it may not readily avail itself to quickly extracting a hard number and using it as a goal.\nThe Number of Equations per page was negatively correlated with reproduction. Two theories as to\nwhy were developed based on our experience implementing the papers: 1) having a larger number\nof equations makes the paper more dif\ufb01cult to read, hence more dif\ufb01cult to reproduce or 2) papers\nwith more equations correspond to more complex and dif\ufb01cult algorithms, naturally being more\ndif\ufb01cult to reproduce. A Kruskal-Wallis ANOVA reveals that the readability hypothesis is signi\ufb01cant\n(p = 0.001) but not the dif\ufb01culty hypothesis (p = 0.239). Following with a Dunn post-hoc test\nshows that papers which have \u201cExcellent\u201d readability have fewer equations per page (2.25 eq/pg)\nthan the others, as the source of the signi\ufb01cant (p \u2264 0.002) relationship. There are no signi\ufb01cant\ndifferences between papers of \u201cLow,\u201d \u201cOk,\u201d and \u201cGood\u201d readability (3.91, 3.60, 3.78 equations per\npage respectively), leading us to postulate that the most readable and reproducible papers make\ncareful and judicious use of equations.\nOur last paper-intrinsic property is Compute Needed, which could be a \u201cDesktop\u201d, \u201cGPU\u201d, \u201cServer\u201d,\nor \u201cCluster\u201d. In the time that these papers were implemented, we have had access to all four compute\nlevels to varying degrees. Looking at Table 18, we see the use of a Cluster or GPU are the ones that\ndepart from expectations. Despite having access to cluster resources, we have never successfully\nreproduced a paper that needed such resources. At the same time, we have a higher reproduction rate\nfor works that require a GPU. Our suspicion is that frameworks such as PyTorch and Tensor\ufb02ow,\nwhich make use of GPUs relatively easy, have been converging toward an effective paradigm for using\nthat kind of resource. These libraries make it easier to reproduce current papers and historical ones that\nlacked such advanced tools, which then in\ufb02ates reproduction rate. While frameworks like Spark exist\nfor distributed computation, they may not be suf\ufb01ciently developed for Machine Learning use cases\nto ease replication. Another alternative hypothesis for Cluster reproduction failure is that the details\nof how a cluster is organized, with interconnects, job scheduling, and more sophisticated code, are\nincreasing the reproduction barrier and lack necessary details. We do not have suf\ufb01cient information\nto con\ufb01rm or reject these hypotheses, but we encourage others to consider them as avenues for study.\nThis leaves us with the last signi\ufb01cant result, which is not a property of the paper itself: whether the\npaper authors reply to questions about their paper. We reached out to the authors of 50 different papers\nand had a reply rate of 52%. Table 7 which shows that replying was the most individually predictive\nattribute studied. In the 24 cases where the author did not respond to questions, we succeeded in\nreplication only once. For the 26 cases where they did reply, we succeeded 22 times. While this\nresult demonstrates the importance of corresponding with readers, it gives credence to the idea of a\nnon-stationary and \u201cliving\u201d paper where updates may be made over time to address questions and\nconcerns. Such is possible today with arxiv.org and distill.pub, and provides quanti\ufb01able evidence that\ntheir ability to update articles is a meaningful and powerful tool toward reproducibility (if leveraged).\nOther confounding hypotheses exist as well, such as receiving a reply increasing the motivation of\nthe reproducers, or the nature of a discussion that is not constrained to a paper\u2019s limitations may also\nimpact reproduction rates.\n\n3.2\n\nInteresting Non-Signi\ufb01cant/Negative Results\n\nWhile we have already discussed some non-signi\ufb01cant results as they relate directly to signi\ufb01cant ones\nabove, we also want to highlight interesting non-signi\ufb01cant results. In particular, we expected a priori\nthat the use of Conceptualization Figures and Exemplar Problems would be signi\ufb01cant predictors,\nas we have found them useful in our personal experiences both to understand the algorithm, and\nas an initial test-bed to con\ufb01rm an algorithm was working to a minimal degree. Yet neither are\nsigni\ufb01cant. We also \ufb01nd that neither have a relationship with a paper\u2019s Readability (p \u2265 0.476).\n\n7\n\n\fThese results give us pause regarding our assumptions about what makes a \u201cgood\u201d reproducible paper,\nand reinforce the importance of quantifying these important questions.\nA positive indicator is that Venue (e.g., NeurIPS vs PKDD) had no signi\ufb01cant impact, nor did Venue\ntype (e.g., Workshop vs Journal). This result would seem to imply that the same issues and successes\nare occurring across most academic levels, though selection bias may play a role in this result.\nThe non-signi\ufb01cance of including an appendix is of note given our results that the papers which are\nhardest to reproduce (\u201cLow\u201d Readability) are shorter on average. There is no signi\ufb01cant difference\nbetween a paper\u2019s readability and the presence of an appendix (p = 0.650), which implies that\nappendices are not suf\ufb01cient means of circumventing page limits at conference/workshop venues.\nWe found it interesting that whether or not the papers\u2019 authors released their code has no signi\ufb01cant\nrelationship with the paper\u2019s independent reproducibility. Before analysis, we could see hypotheticals\nthat would cause correlations in either direction. Authors who release code might include less details\nin the paper under the assumption that readers will \ufb01nd them in the code itself. Conversely, one might\nimagine that authors who release code care more about reproduction and would include more of the\nnecessary details. With more conferences encouraging code availability as a reviewer criteria, we\nwould not necessarily expect any change in independent reproducibility from this change in isolation\n(impacts on cultural changes induced being a question beyond our scope).\n\n4 Study De\ufb01ciencies\n\nWhile we have taken the \ufb01rst step toward studying and quantifying factors of reproducibility, we\nmust also acknowledge de\ufb01ciencies in our study. Most apparent are a number of potential biases. The\npapers under consideration have a selection bias based on interest and \ufb01ltering from consideration\nany paper where we had previously looked at released source code. More importantly, all papers\nwere attempted by just this paper\u2019s author. So while we have a large sample size of papers, we have\na low sample size of implementers. It is entirely possible that those with a different background in\neducation, training, career, and interests, would \ufb01nd different papers easy or dif\ufb01cult to reproduce.\nBecause we are the sole reproducer, all the results must be taken with consideration conditioned\non our background, and the origin of this work. A majority of attempted reproductions where in\npursuit of contribution to a machine learning library that we are the author of, JSAT [17]. As such,\nwe focused initially on a number of more common and widely used algorithm. These methods had\nalready been independently reproduced by others many times, and alternative materials (e.g., lecture\nnotes) were available to provide guidance without consulting code written by others. Further papers\nwhere spurred by our personal interest in what we considered useful for such a library, and our\nown personal interests (historical interests including nearest neighbor algorithms, linear models, and\nkernel methods). Such well known works do not make the majority of reproduction attempts, but they\nmake up a sizable sub-population of the methods we attempted for JSAT, and so may skew results.\nOur study is also limited by our own historical records. The use of paper cataloging software to take\nnotes and record information made this study possible, but it also limits our study to the recorded\nnotes and what can be re-derived from the paper itself (e.g., number of pages).\nTowards improving upon the number of implementers and recording information, we hope to en-\ncourage extensions to projects such as the ICLR Reproducibility Challenge. A communal effort to\nstandardize on an initial set of paper features, keep track of time and resources spent on reproduction,\nand information about the reproducers (years of experience, education, and background) may allow\nfor a richer and more thorough macro study of reproducibility in the future. A design constraint we\nwould like to include in such a system is differential privacy so that it is not known which individual\npapers are having reproduction dif\ufb01culties. We have intentionally avoided identifying papers to avoid\nany perceived \u201cnaming and shaming\u201d, as our or other attempts in isolation should not be seen as\nconclusive statements on any individual paper\u2019s lack of reproduction.\nIn our experience attempting to reproduce these papers, we also note a failure in the framing of the\nproblem: that a paper is reproducible or not. Depending on the paper, differing levels of resources\nand even teams may be necessary for reproduction. As a point of reference, the longest effort toward\nreproduction we studied took 4.5 years of (non-continuous) effort to \ufb01nally reproduce the results.\nIn this light, it may be better to model reproduction as a kind of survival analysis conditioned on\nproperties of the implementer(s). A paper \u201csurvives\u201d as the implementers attempt reproduction and\n\n8\n\n\f\u201cdies\u201d once successfully reproduced (or \u201clives\u201d forever if never reproduced). Viewed in this light, we\nmay ask: what environmental factors (e.g., libraries like PyTorch, Scikit-Learn, compute resources)\nimpact survival rates and times, and should the necessity of code release be a function of survival\ntime? A real life example of this is playing out now, as people attempt to reproduce OpenAI\u2019s recent\nGPT-2 results5, where information and data was intentionally withheld due to security concerns.\nAn important factor not included in our analysis are the authors of a paper, which has a direct impact\non writing style, topic, and other factors. Subjectively we note that there are authors whose work we\nregularly fail to reproduce and ones we regularly succeed in reproducing, even when both make code\navailable. Study of how the backgrounds and styles of both authors and implementers interact and\nimpact reproduction seems to be a valuable line of inquiry, but it is beyond our current scope and\nrequires additional thought and consideration.\nWe also note that the most signi\ufb01cant factors in reproducibility are the most subjective factors. While\nwe endeavored to reduce the impact of subjectivity as much as possible with our stated protocols,\nthis indicates that more work is warranted in developing more objective measures that are related to\nthese subjective factors, or using communal effort to reach a distributional determination on these\nsubjective factors, for future studies.\n\n5 A Subjective Recall of Non-Reproduction\n\nWe did not record the believed reason for failure to reproduce, although this would have been valuable\ninformation. We hope that this will be noted by others in the future, but for now we recount a\nsubjective summary of the primary reasons we felt a paper could not be reproduced. We note that\npart of our belief in the below list stems from our efforts to email papers\u2019 authors when attempting to\nindependently reproduce their works, in which we are often seeking information that would elucidate\nthe below issues:\n\n1. Unclear notation or language. A component of the algorithm is explained, but not in a way\n\neasily understood by the reproducers, or was ambiguously speci\ufb01ed.\n\n2. Missing algorithm step or details, a step was completely left out of description.\n3. Many papers would specify loss functions or other equations for which the gradient needed\nto be taken, but not detail the resulting gradients. Depending on the functions and math\ninvolved re-deriving was non-trivial, and our results did not match.\n\n4. Missing hyper-parameters, or similar nuance details. The reproducers believe we have an\nimplementation accurate to what was described, but some \u201cminor\u201d detail was not speci\ufb01ed\nand makes a big difference in results.\n\nWe avoided in this paper any attempt to imply or cast doubt on the veracity of any individual paper.\nIn our experiences through this work, we have rarely had suspicion that the results of a paper were\nfalse or the result of serious \ufb02awed implementation, and thus could never be reproduced.\n\n6 Conclusions\n\nIn this work we have conducted the \ufb01rst empirical study of what impacts a paper\u2019s reproducibility. We\nsuspect this will lead to considerable debate about the meaning of results, and we hope to spur further\nquanti\ufb01able studies. Based on our results, we \ufb01nd that paper reproduction rates have not changed\n(in a statistically signi\ufb01cant way) over the past 35 years. Papers of a more empirical nature tend to\nbe more reproducible, as are ones that include factors relevant to implementation details \u2014 though\nsimply including Pseudo-Code is not suf\ufb01cient. Our study indicates papers with fewer equations and\nmore tables tend to be more reproducible, and that there is a potential latent issue in reproduction\nwhen cluster computing becomes a requirement.\n\nAcknowledgements\n\nI would like to thank Jared Sylvester, Arash Rahnama, Charles Nicholas, Cynthia Matuszek, Frank\nFerraro, Ian Soboroff, and Ashley Klein, who all provided valuable discussion and feedback on this\nwork through its formation to completion.\n\n5https://openai.com/blog/better-language-models/\n\n9\n\n\fReferences\n\n[1] M. Hutson, \u201cArti\ufb01cial intelligence faces reproducibility crisis,\u201d Science, vol. 359, no. 6377, pp.\n\n725\u2013726, 2018. [Online]. Available: https://science.sciencemag.org/content/359/6377/725\n\n[2] R. Dror, G. Baumer, M. Bogomolov, and R. Reichart, \u201cReplicability Analysis for Natural\nLanguage Processing: Testing Signi\ufb01cance with Multiple Datasets,\u201d Transactions of the\nAssociation for Computational Linguistics, vol. 5, pp. 471\u2013486, 12 2017. [Online]. Available:\nhttps://www.aclweb.org/anthology/Q17-1033\n\n[3] R. Tatman, J. Vanderplas, and S. Dane, \u201cA Practical Taxonomy of Reproducibility for Machine\n\nLearning Research,\u201d in Reproducibility in ML Workshop, ICML\u201918, 2018.\n\n[4] G. C. Publio, D. Esteves, and H. Zafar, \u201cML-Schema : Exposing the Semantics of Machine\nLearning with Schemas and Ontologies,\u201d in Reproducibility in ML Workshop, ICML\u201918, 2018.\n\n[5] J. Forde, T. Head, C. Holdgraf, Y. Panda, F. Perez, G. Nalvarte, B. Ragan-kelley, and E. Sundell,\n\u201cReproducible Research Environments with repo2docker,\u201d in Reproducibility in ML Workshop,\nICML\u201918, 2018.\n\n[6] C. Drummond, \u201cReplicability is not reproducibility: nor is it good science,\u201d in Proceedings\nof the Evaluation Methods for Machine Learning Workshop at the 26th ICML, Montreal,\nCanada,2009, ser. Evaluation Methods for Machine Learning Workshop, the 26th ICML, June\n14-18, 2009, Montreal, Canada, 2009.\n\n[7] O. E. Gundersen and S. Kjensmo, \u201cState of the Art: Reproducibility in Arti\ufb01cial Intelligence,\u201d\nProceedings of the 32nd AAAI Conference on Arti\ufb01cial Intelligence (AAAI-18), pp. 1644\u20131651,\n2018.\n\n[8] A. Rahimi and B. Recht, \u201cNIPS 2017 Test-of-time award presentation,\u201d 2017.\n\n[9] D. Sculley, J. Snoek, A. Rahimi, and A. Wiltschko, \u201cWinner\u2019s Curse? On Pace,\nProgress, and Empirical Rigor,\u201d in ICLR Workshop track, 2018. [Online]. Available:\nhttps://openreview.net/pdf?id=rJWF0Fywf\n\n[10] H. B. Mann and D. R. Whitney, \u201cOn a Test of Whether one of Two Random Variables is\nStochastically Larger than the Other,\u201d The Annals of Mathematical Statistics, vol. 18, no. 1, pp.\n50\u201360, 3 1947. [Online]. Available: http://projecteuclid.org/euclid.aoms/1177730491\n\n[11] S. S. Shapiro and M. B. Wilk, \u201cAn analysis of variance test for normality (complete\nsamples),\u201d Biometrika, vol. 52, no. 3-4, pp. 591\u2013611, 12 1965. [Online]. Available:\nhttps://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/52.3-4.591\n\n[12] K. Pearson, \u201cOn the criterion that a given system of deviations from the probable in the\ncase of a correlated system of variables is such that it can be reasonably supposed to\nhave arisen from random sampling,\u201d The London, Edinburgh, and Dublin Philosophical\nMagazine and Journal of Science, vol. 50, no. 302, pp. 157\u2013175, 7 1900. [Online]. Available:\nhttps://www.tandfonline.com/doi/full/10.1080/14786440009463897\n\n[13] F. Yates, \u201cContingency Tables Involving Small Numbers and the \u03c72 Test,\u201d Supplement to the\nJournal of the Royal Statistical Society, vol. 1, no. 2, pp. 217\u2013235, 1934. [Online]. Available:\nhttp://www.jstor.org/stable/2983604\n\n[14] W. H. Kruskal and W. A. Wallis, \u201cUse of Ranks in One-Criterion Variance Analysis,\u201d Journal\nof the American Statistical Association, vol. 47, no. 260, pp. 583\u2013621, 12 1952. [Online].\nAvailable: https://www.tandfonline.com/doi/abs/10.1080/01621459.1952.10483441\n\n[15] O.\n\nJ. Dunn,\n\n\u201cMultiple Comparisons Among Means,\u201d\n\nStatistical Association, vol. 56, no. 293, pp. 52\u201364, 1961.\nhttp://www.jstor.org/stable/2282330\n\nJournal of\n\nthe American\n[Online]. Available:\n\n[16] JASP Team, \u201cJASP (Version 0.9)[Computer software],\u201d 2018.\n\nhttps://jasp-stats.org/\n\n[Online]. Available:\n\n10\n\n\f[17] E. Raff, \u201cJSAT: Java Statistical Analysis Tool, a Library for Machine Learning,\u201d Journal\nof Machine Learning Research, vol. 18, no. 23, pp. 1\u20135, 2017. [Online]. Available:\nhttp://jmlr.org/papers/v18/16-131.html\n\n11\n\n\f", "award": [], "sourceid": 2932, "authors": [{"given_name": "Edward", "family_name": "Raff", "institution": "Booz Allen Hamilton"}]}