The disconnect between classical biostatistics and the biological data mining community
1 Center for Information Technology, The National Institutes of Health, Bethesda, MD, United States of America
2 Departments of Genetics and Community and Family Medicine, Institute for Quantitative Biomedical Sciences, The Geisel School of Medicine, Dartmouth College, One Medical Center Dr, Lebanon, NH, 03756, United States of America
BioData Mining 2013, 6:12 doi:10.1186/1756-0381-6-12Published: 24 July 2013
First paragraph (this article has no abstract)
Statistics departments and journals still strongly emphasize a very narrow range of topics and methods and techniques, all driven by a tiny handful of results, many dating from the 1930s. Those methods may well have been good and amazing and quite appropriate for the available computing, known mathematical facts, and data of their day. Hence the common list of assumptions: normal distributions and very small parametric models and linearity and independent features. But the usual claims for these anchoring assumptions are accurate—when precisely true—but more often just irrelevant: data is rarely normal, model misspecification is always at work, features are highly entangled with functionally mysterious interactions, and multiple scientifically plausible models may all fit the data equally well.