EBookClubs

Read Books & Download eBooks Full Online

EBookClubs

Read Books & Download eBooks Full Online

Book Penalized Regressions for Variable Selection Model  Single Index Model and an Analysis of Mass Spectrometry Data

Download or read book Penalized Regressions for Variable Selection Model Single Index Model and an Analysis of Mass Spectrometry Data written by Yubing Wan and published by . This book was released on 2014 with total page 84 pages. Available in PDF, EPUB and Kindle. Book excerpt: The focus of this dissertation is to develop statistical methods, under the framework of penalized regressions, to handle three different problems. The first research topic is to address missing data problem for variable selection models including elastic net (ENet) method and sparse partial least squares (SPLS). I proposed a multiple imputation (MI) based weighted ENet (MI-WENet) method based on the stacked MI data and a weighting scheme for each observation. Numerical simulations were implemented to examine the performance of the MIWENet method, and compare it with competing alternatives. I then applied the MI-WENet method to examine the predictors for the endothelial function characterized by median effective dose and maximum effect in an ex-vivo experiment. The second topic is to develop monotonic single-index models for assessing drug interactions. In single-index models, the link function f is unnecessary monotonic. However, in combination drug studies, it is desired to have a monotonic link function f . I proposed to estimate f by using penalized splines with I-spline basis. An algorithm for estimating f and the parameter a in the index was developed. Simulation studies were conducted to examine the performance of the proposed models in term of accuracy in estimating f and a. Moreover, I applied the proposed method to examine the drug interaction of two drugs in a real case study. The third topic was focused on the SPLS and ENet based accelerated failure time (AFT) models for predicting patient survival time with mass spectrometry (MS) data. A typical MS data set contains limited number of spectra, while each spectrum contains tens of thousands of intensity measurements representing an unknown number of peptide peaks as the key features of interest. Due to the high dimension and high correlations among features, traditional linear regression modeling is not applicable. Semi-parametric AFT model with an unspecified error distribution is a well-accepted approach in survival analysis. To reduce the bias caused in denoising step, we proposed a nonparametric imputation approach based on Kaplan-Meier estimator. Numerical simulations and a real case study were conducted under the proposed method.

Book Variable Selection Via Penalized Regression and the Genetic Algorithm Using Information Complexity  with Applications for High dimensional  omics Data

Download or read book Variable Selection Via Penalized Regression and the Genetic Algorithm Using Information Complexity with Applications for High dimensional omics Data written by Tyler J. Massaro and published by . This book was released on 2016 with total page 360 pages. Available in PDF, EPUB and Kindle. Book excerpt: This dissertation is a collection of examples, algorithms, and techniques for researchers interested in selecting influential variables from statistical regression models. Chapters 1, 2, and 3 provide background information that will be used throughout the remaining chapters, on topics including but not limited to information complexity, model selection, covariance estimation, stepwise variable selection, penalized regression, and especially the genetic algorithm (GA) approach to variable subsetting. In chapter 4, we fully develop the framework for performing GA subset selection in logistic regression models. We present advantages of this approach against stepwise and elastic net regularized regression in selecting variables from a classical set of ICU data. We further compare these results to an entirely new procedure for variable selection developed explicitly for this dissertation, called the post hoc adjustment of measured effects (PHAME). In chapter 5, we reproduce many of the same results from chapter 4 for the first time in a multinomial logistic regression setting. The utility and convenience of the PHAME procedure is demonstrated on a set of cancer genomic data. Chapter 6 marks a departure from supervised learning problems as we shift our focus to unsupervised problems involving mixture distributions of count data from epidemiologic fields. We start off by reintroducing Minimum Hellinger Distance estimation alongside model selection techniques as a worthy alternative to the EM algorithm for generating mixtures of Poisson distributions. We also create for the first time a GA that derives mixtures of negative binomial distributions. The work from chapter 6 is incorporated into chapters 7 and 8, where we conclude the dissertation with a novel analysis of mixtures of count data regression models. We provide algorithms based on single and multi-target genetic algorithms which solve the mixture of penalized count data regression models problem, and we demonstrate the usefulness of this technique on HIV count data that were used in a previous study published by Gray, Massaro et al. (2015) as well as on time-to-event data taken from the cancer genomic data sets from earlier.

Book High Dimensional Data Analysis in Cancer Research

Download or read book High Dimensional Data Analysis in Cancer Research written by Xiaochun Li and published by Springer Science & Business Media. This book was released on 2008-12-19 with total page 164 pages. Available in PDF, EPUB and Kindle. Book excerpt: Multivariate analysis is a mainstay of statistical tools in the analysis of biomedical data. It concerns with associating data matrices of n rows by p columns, with rows representing samples (or patients) and columns attributes of samples, to some response variables, e.g., patients outcome. Classically, the sample size n is much larger than p, the number of variables. The properties of statistical models have been mostly discussed under the assumption of fixed p and infinite n. The advance of biological sciences and technologies has revolutionized the process of investigations of cancer. The biomedical data collection has become more automatic and more extensive. We are in the era of p as a large fraction of n, and even much larger than n. Take proteomics as an example. Although proteomic techniques have been researched and developed for many decades to identify proteins or peptides uniquely associated with a given disease state, until recently this has been mostly a laborious process, carried out one protein at a time. The advent of high throughput proteome-wide technologies such as liquid chromatography-tandem mass spectroscopy make it possible to generate proteomic signatures that facilitate rapid development of new strategies for proteomics-based detection of disease. This poses new challenges and calls for scalable solutions to the analysis of such high dimensional data. In this volume, we will present the systematic and analytical approaches and strategies from both biostatistics and bioinformatics to the analysis of correlated and high-dimensional data.

Book Statistical Learning with Sparsity

Download or read book Statistical Learning with Sparsity written by Trevor Hastie and published by CRC Press. This book was released on 2015-05-07 with total page 354 pages. Available in PDF, EPUB and Kindle. Book excerpt: Discover New Methods for Dealing with High-Dimensional DataA sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and interpret than a dense model. Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underl

Book A Non iterative Method for Fitting the Single Index Quantile Regression Model with Uncensored and Censored Data

Download or read book A Non iterative Method for Fitting the Single Index Quantile Regression Model with Uncensored and Censored Data written by Eliana Christou and published by . This book was released on 2016 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Quantile regression (QR) is becoming increasingly popular due to its relevance in many scientific investigations. Linear and nonlinear QR models have been studied extensively, while recent research focuses on the single index quantile regression (SIQR) model. Compared to the single index mean regression (SIMR) problem, the fitting and the asymptotic theory of the SIQR model are more complicated due to the lack of closed form expressions for estimators of conditional quantiles. Consequently, existing methods are necessarily iterative. We propose a non-iterative estimation algorithm, and derive the asymptotic distribution of the proposed estimator under heteroscedasticity. For identifiability, we use a parametrization that sets the first coefficient to 1 instead of the typical condition which restricts the norm of the parametric component. This distinction is more than simply cosmetic as it affects, in a critical way, the correspondence between the estimator derived and the asymptotic theory. The ubiquity of high dimensional data has led to a number of variable selection methods for linear/nonlinear QR models and, recently, for the SIQR model. We propose a new algorithm for simultaneous variable selection and parameter estimation applicable also for heteroscedastic data. The proposed algorithm, which is non-iterative, consists of two steps. Step 1 performs an initial variable selection method. Step 2 uses the results of Step 1 to obtain better estimation of the conditional quantiles and, using them, to perform simultaneous variable selection and estimation of the parametric component of the SIQR model. It is shown that the initial variable selection method of Step 1 consistently estimates the relevant variables, and that the estimated parametric component derived in Step 2 satisfies the oracle property. Furthermore, QR is particularly relevant for the analysis of censored survival data as an alternative to proportional hazards and the accelerated failure time models. Such data occur frequently in biostatistics, environmental sciences, social sciences and econometrics. There is a large body of work for linear/nonlinear QR models for censored data, but it is only recently that the SIQR model has received some attention. However, the only existing method for fitting the SIQR model uses an iterative algorithm and no asymptotic theory for the resulting estimator of the Euclidean parameter is given. We propose a new non-iterative estimation algorithm, and derive the asymptotic distribution of the proposed estimator under heteroscedasticity.

Book Variable Selection with Penalized Gaussian Process Regression Models

Download or read book Variable Selection with Penalized Gaussian Process Regression Models written by Gang Yi and published by . This book was released on 2010 with total page 151 pages. Available in PDF, EPUB and Kindle. Book excerpt:

Book Comprehensive Chemometrics

Download or read book Comprehensive Chemometrics written by Steven Brown and published by Elsevier. This book was released on 2020-05-26 with total page 2948 pages. Available in PDF, EPUB and Kindle. Book excerpt: Comprehensive Chemometrics, Second Edition, Four Volume Set features expanded and updated coverage, along with new content that covers advances in the field since the previous edition published in 2009. Subject of note include updates in the fields of multidimensional and megavariate data analysis, omics data analysis, big chemical and biochemical data analysis, data fusion and sparse methods. The book follows a similar structure to the previous edition, using the same section titles to frame articles. Many chapters from the previous edition are updated, but there are also many new chapters on the latest developments. Presents integrated reviews of each chemical and biological method, examining their merits and limitations through practical examples and extensive visuals Bridges a gap in knowledge, covering developments in the field since the first edition published in 2009 Meticulously organized, with articles split into 4 sections and 12 sub-sections on key topics to allow students, researchers and professionals to find relevant information quickly and easily Written by academics and practitioners from various fields and regions to ensure that the knowledge within is easily understood and applicable to a large audience Presents integrated reviews of each chemical and biological method, examining their merits and limitations through practical examples and extensive visuals Bridges a gap in knowledge, covering developments in the field since the first edition published in 2009 Meticulously organized, with articles split into 4 sections and 12 sub-sections on key topics to allow students, researchers and professionals to find relevant information quickly and easily Written by academics and practitioners from various fields and regions to ensure that the knowledge within is easily understood and applicable to a large audience

Book Composite Quantile Regression for the Single Index Model

Download or read book Composite Quantile Regression for the Single Index Model written by Yan Fan and published by . This book was released on 2017 with total page 43 pages. Available in PDF, EPUB and Kindle. Book excerpt: Quantile regression is in the focus of many estimation techniques and is an important tool in data analysis. When it comes to nonparametric specifications of the conditional quantile (or more generally tail) curve one faces, as in mean regression, a dimensionality problem. We propose a projection based single index model specification. For very high dimensional regressors X one faces yet another dimensionality problem and needs to balance precision vs. dimension. Such a balance may be achieved by combining semiparametric ideas with variable selection techniques.

Book Variable Selection Via Penalized Likelihood

Download or read book Variable Selection Via Penalized Likelihood written by and published by . This book was released on 2014 with total page 121 pages. Available in PDF, EPUB and Kindle. Book excerpt: Variable selection via penalized likelihood plays an important role in high dimensional statistical modeling and it has attracted great attention in recent literature. This thesis is devoted to the study of variable selection problem. It consists of three major parts, all of which fall within the framework of penalized least squares regression setting. In the first part of this thesis, we propose a family of nonconvex penalties named the K-Smallest Items (KSI) penalty for variable selection, which is able to improve the performance of variable selection and reduce estimation bias on the estimates of the important coefficients. We fully investigate the theoretical properties of the KSI method and show that it possesses the weak oracle property and the oracle property in the high-dimensional setting where the number of coefficients is allowed to be much larger than the sample size. To demonstrate its numerical performance, we applied the KSI method to several simulation examples as well as the well known Boston housing dataset. We also extend the idea of the KSI method to handle the group variable selection problem. In the second part of this thesis, we propose another nonconvex penalty named Self-adaptive penalty (SAP) for variable selection. It is distinguished from other existing methods in the sense that the penalization on each individual coefficient takes into account directly the influence of other estimated coefficients. We also thoroughly study the theoretical properties of the SAP method and show that it possesses the weak oracle property under desirable conditions. The proposed method is applied to the glioblastoma cancer data obtained from The Cancer Genome Atlas. In many scientific and engineering applications, covariates are naturally grouped. When the group structures are available among covariates, people are usually interested in identifying both important groups and important variables within the selected groups. In statistics, this is a group variable selection problem. In the third part of this thesis, we propose a novel Log-Exp-Sum(LES) penalty for group variable selection. The LES penalty is strictly convex. It can identify important groups as well as select important variables within the group. We develop an efficient group-level coordinate descent algorithm to fit the model. We also derive non-asymptotic error bounds and asymptotic group selection consistency for our method in the high-dimensional setting. Numerical results demonstrate the good performance of our method in both variable selection and prediction. We applied the proposed method to an American Cancer Society breast cancer survivor dataset. The findings are clinically meaningful and may help design intervention programs to improve the quality of life for breast cancer survivors.

Book Chemometrics in Chromatography

Download or read book Chemometrics in Chromatography written by Łukasz Komsta and published by CRC Press. This book was released on 2018-02-02 with total page 506 pages. Available in PDF, EPUB and Kindle. Book excerpt: Chemometrics uses advanced mathematical and statistical algorithms to provide maximum chemical information by analyzing chemical data, and obtain knowledge of chemical systems. Chemometrics significantly extends the possibilities of chromatography and with the technological advances of the personal computer and continuous development of open-source software, many laboratories are interested in incorporating chemometrics into their chromatographic methods. This book is an up-to-date reference that presents the most important information about each area of chemometrics used in chromatography, demonstrating its effective use when applied to a chromatographic separation.

Book Variable Selection and Function Estimation Using Penalized Methods

Download or read book Variable Selection and Function Estimation Using Penalized Methods written by Ganggang Xu and published by . This book was released on 2012 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Penalized methods are becoming more and more popular in statistical research. This dissertation research covers two major aspects of applications of penalized methods: variable selection and nonparametric function estimation. The following two paragraphs give brief introductions to each of the two topics. Infinite variance autoregressive models are important for modeling heavy-tailed time series. We use a penalty method to conduct model selection for autoregressive models with innovations in the domain of attraction of a stable law indexed by alpha is an element of (0, 2). We show that by combining the least absolute deviation loss function and the adaptive lasso penalty, we can consistently identify the true model. At the same time, the resulting coefficient estimator converges at a rate of n(1/alpha) . The proposed approach gives a unified variable selection procedure for both the finite and infinite variance autoregressive models. While automatic smoothing parameter selection for nonparametric function estimation has been extensively researched for independent data, it is much less so for clustered and longitudinal data. Although leave-subject-out cross-validation (CV) has been widely used, its theoretical property is unknown and its minimization is computationally expensive, especially when there are multiple smoothing parameters. By focusing on penalized modeling methods, we show that leave-subject-out CV is optimal in that its minimization is asymptotically equivalent to the minimization of the true loss function. We develop an efficient Newton-type algorithm to compute the smoothing parameters that minimize the CV criterion. Furthermore, we derive one simplification of the leave-subject-out CV, which leads to a more efficient algorithm for selecting the smoothing parameters. We show that the simplified version of CV criteria is asymptotically equivalent to the unsimplified one and thus enjoys the same optimality property. This CV criterion also provides a completely data driven approach to select working covariance structure using generalized estimating equations in longitudinal data analysis. Our results are applicable to additive, linear varying-coefficient, nonlinear models with data from exponential families.

Book Efficient Nonparametric and Semiparametric Regression Methods with Application in Case Control Studies

Download or read book Efficient Nonparametric and Semiparametric Regression Methods with Application in Case Control Studies written by Shahina Rahman and published by . This book was released on 2015 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: Regression Analysis is one of the most important tools of statistics which is widely used in other scientific fields for projection and modeling of association between two variables. Nowadays with modern computing techniques and super high performance devices, regression analysis on multiple dimensions has become an important issue. Our task is to address the issue of modeling with no assumption on the mean and the variance structure and further with no assumption on the error distribution. In other words, we focus on developing robust semiparametric and nonparamteric regression problems. In modern genetic epidemiological association studies, it is often important to investigate the relationships among the potential covariates related to disease in case-control data, a study known as "Secondary Analysis". First we focus to model the association between the potential covariates in univariate dimension nonparametrically. Then we focus to model the association in mulivariate set up by assuming a convenient and popular multivariate semiparametric model, known as Single-Index Model. The secondary analysis of case-control studies is particularly challenging due to multiple reasons (a) the case-control sample is not a random sample, (b) the logistic intercept is practically not identifiable and (c) misspecification of error distribution leads to inconsistent results. For rare disease, controls (individual free of disease) are typically used for valid estimation. However, numerous publication are done to utilize the entire case-control sample (including the diseased individual) to increase the efficiency. Previous work in this context has either specified a fully parametric distribution for regression errors or specified a homoscedastic distribution for the regression errors or have assumed parametric forms on the regression mean. In the first chapter we focus on to predict an univariate covariate Y by another potential univariate covariate X neither by any parametric form on the mean function nor by any distributional assumption on error, hence addressing potential heteroscedasticity, a problem which has not been studied before. We develop a tilted Kernel based estimator which is a first attempt to model the mean function nonparametrically in secondary analysis. In the following chapters, we focus on i.i.d samples to model both the mean and variance function for predicting Y by multiple covariates X without assuming any form on the regression mean. In particular we model Y by a single-index model m(X^T [Lowercase theta symbol]), where [Lowercase theta symbol] is a single-index vector and m is unspecified. We also model the variance function by another flexible single index model. We develop a practical and readily applicable Bayesian methodology based on penalized spline and Markov Chain Monte Carlo (MCMC) both in i.i.d set up and in case-control set up. For efficient estimation, we model the error distribution by a Dirichlet process mixture models of Normals (DPMM). In numerical examples, we illustrate the finite sample performance of the posterior estimates for both i.i.d and for case-control set up. For single-index set up, in i.i.d case only one existing work based on local linear kernel method addresses modeling of the variance function. We found that our method based on DPMM vastly outperforms the other existing method in terms of mean square efficiency and computation stability. We develop the single-index modeling in secondary analysis to introduce flexible mean and variance function modeling in case-control studies, a problem which has not been studies before. We showed that our method is almost 2 times efficient than using only controls, which is typically used for many cases. We use the real data example from NIH-AARP study on breast cancer, from Colon Cancer Study on red meat consumption and from National Morbidity Air Pollution Study to illustrate the computational efficiency and stability of our methods. The electronic version of this dissertation is accessible from http://hdl.handle.net/1969.1/155719

Book Regularized Regression in Generalized Linear Measurement Error Models with Instrumental Variables  variable Selection and Parameter Estimation

Download or read book Regularized Regression in Generalized Linear Measurement Error Models with Instrumental Variables variable Selection and Parameter Estimation written by Lin Xue and published by . This book was released on 2020 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Regularization method is a commonly used technique in high dimensional data analysis. With properly chosen tuning parameter for certain penalty functions, the resulting estimator is consistent in both variable selection and parameter estimation. Most regularization methods assume that the data can be observed and precisely measured. However, it is well-known that the measurement error (ME) is ubiquitous in real-world datasets. In many situations some or all covariates cannot be observed directly or are measured with errors. For example, in cardiovascular disease related studies, the goal is to identify important risk factors such as blood pressure, cholesterol level and body mass index, which cannot be measured precisely. Instead, the corresponding proxies are employed for analysis. If the ME is ignored in regularized regression, the resulting naive estimator can have high selection and estimation bias. Accordingly, the important covariates are falsely dropped from the model and the redundant covariates are retained in the model incorrectly. We illustrate how ME affects the variable selection and parameter estimation through theoretical analysis and several numerical examples. To correct for the ME effects, we propose the instrumental variable assisted regularization method for linear and generalized linear models. We showed that the proposed estimator has the oracle property such that it is consistent in both variable selection and parameter estimation. The asymptotic distribution of the estimator is derived. In addition, we showed that the implementation of the proposed method is equivalent to the plug-in approach under linear models, and the asymptotic variance-covariance matrix has a compact form. Extensive simulation studies in linear, logistic and poisson log-linear regression showed that the proposed estimator outperforms the naive estimator in both linear and generalized linear models. Although the focus of this study is the classical ME, we also discussed the variable selection and estimation in the setting of Berkson ME. In particular, our finite sample simulation studies show that in contrast to the estimation in linear regression, the Berkson ME may cause bias in variable selection and estimation. Finally, the proposed method is applied to real datasets of diabetes and Framingham heart study.

Book Penalized Method Based on Representatives and Nonparametric Analysis of Gap Data

Download or read book Penalized Method Based on Representatives and Nonparametric Analysis of Gap Data written by Soyoun Park and published by . This book was released on 2010 with total page pages. Available in PDF, EPUB and Kindle. Book excerpt: When there are a large number of predictors and few observations, building a regression model to explain the behavior of a response variable such as a patient's medical condition is very challenging. This is a "p 9́±n " variable selection problem encountered often in modern applied statistics and data mining. Chapter one of this thesis proposes a rigorous procedure which groups predictors into clusters of "highly-correlated" variables, selects a representative from each cluster, and uses a subset of the representatives for regression modeling. The proposed Penalized method based on Representatives (PR) extends the Lasso for the p 9́± n data and highly correlated variables, to build a sparse model practically interpretable and maintain prediction quality. Moreover, we provide the PR-Sequential Grouped Regression (PR-SGR) to make computation of the PR procedure efficient. Simulation studies show the proposed method outperforms existing methods such as the Lasso/Lars. A real-life example from a mental health diagnosis illustrates the applicability of the PR-SGR. In the second part of the thesis, we study the analysis of time-to-event data called a gap data when missing time intervals (gaps) possibly happen prior to the first observed event time. If a gap occurs prior to the first observed event, then the first observed event may or may not be the first true event. This incomplete knowledge makes the gap data different from the well-studied regular interval censored data. We propose a Non-Parametric Estimate for the Gap data (NPEG) to estimate the survival function for the first true event time, derive its analytic properties and demonstrate its performance in simulations. We also extend the Imputed Empirical Estimating method (IEE), which is an existing nonparametric method for the gap data up to one gap, to handle the gap data with multiple gaps. (20 rows).

Book Variable Selection for High dimensional Data with Error Control

Download or read book Variable Selection for High dimensional Data with Error Control written by Han Fu (Ph. D. in biostatistics) and published by . This book was released on 2022 with total page 0 pages. Available in PDF, EPUB and Kindle. Book excerpt: Many high-throughput genomic applications involve a large set of covariates and it is crucial to discover which variables are truly associated with the response. It is often desirable for researchers to select variables that are indeed true and reproducible in followup studies. Effectively controlling the false discovery rate (FDR) increases the reproducibility of the discoveries and has been a major challenge in variable selection research, especially for high-dimensional data. Existing error control approaches include augmentation approaches which utilize artificial variables as benchmarks for decision making, such as model-X knockoffs. We introduce another augmentation-based selection framework extended from a Bayesian screening approach called reference distribution variable selection. Ordinal responses, which were not previously considered in this area, were used to compare different variable selection approaches. We constructed various importance measures that fit into the selection frameworks, using either L1 penalized regression or machine learning techniques, and compared these measures in terms of the FDR and power using simulated data. Moreover, we applied these selection methods to high-throughput methylation data for identifying features associated with the progression from normal liver tissue to hepatocellular carcinoma to further compare and contrast their performances. Having established the effectiveness of FDR control for model-X knockoffs, we turned our attention to another important data type - survival data with long-term survivors. Medical breakthroughs in recent years have led to cures for many diseases, resulting in increased observations of long-term survivors. The mixture cure model (MCM) is a type of survival model that is often used when a cured fraction exists. Unfortunately, currently few variable selection methods exist for MCMs when there are more predictors than samples. To fill the gap, we developed penalized MCMs for high-dimensional datasets which allow for identification of prognostic factors associated with both cure status and/or survival. Both parametric models and semi-parametric proportional hazards models were considered for modeling the survival component. For penalized parametric MCMs, we demonstrated how the estimation proceeded using two different iterative algorithms, the generalized monotone incremental forward stagewise (GMIFS) and Expectation-Maximization (E-M). For semi-parametric MCMs where multiple types of penalty functions were considered, the coordinate descent algorithm was combined with E-M for optimization. The model-X knockoffs method was combined with these algorithms to allow for FDR control in variable selection. Through extensive simulation studies, our penalized MCMs have been shown to outperform alternative methods on multiple metrics and achieve high statistical power with FDR being controlled. In two acute myeloid leukemia (AML) applications with gene expression data, our proposed approaches identified important genes associated with potential cure or time-to-relapse, which may help inform treatment decisions for AML patients.