Download [PDF]

Mateu Sbert

Title: Generalized entropies from f-divergences

Authors: Mateu Sbert, Min Chen, Jordi Poch, Miquel Feixas and Shuning Chen

Keywords: f-divergences, Shannon entropy, generalized entropies, majorization

Abstract: In this talk we present our recent results on generalized entropies and its properties. Measuring the homogeneity or diversity of a distribution is essential across various disciplines, from evaluating the code produced by a cryptographic algorithm to the biodiversity as a measure of ecosystem health. Different fields require diverse homogeneity measures to capture distinct evaluation perspectives, similar to how different types of means can be calculated for the same data sequence. Shannon entropy, introduced to measure the content of information of a distribution, and one-parameter entropies such as Tsallis and Rényi ones, have been frequently used as homogeneity measures. On the other hand, majority pre-order can be considered as a ground truth to compare homogeneity between distributions. Thus, a homogeneity measure, that would allow to compare any two distributions, should keep majority order. But this is precisely the definition of Schur-convexity, or Schur-concavity if order is reversed. f-divergences, which depend on a convex function, measure the difference between two distributions. Based on different approaches in the literature, we derive two families of generalized entropies, either directly from an f-divergence or by using its defining convex function. These new entropy families extend Shannon entropy, exhibit Schur-concavity, and hold the important grouping or aggregation property. This grouping property implies that entropy decreases whenever distribution indices are aggregated, making it a useful tool for simplifying distributions and achieving various levels of detail (LOD).

Laura Vicente-Gonzalez

Title: Polychoric Dual STATIS: A Novel Approach for Categorical Data

Authors: Laura Vicente-Gonzalez, Elisa Frutos-Bernal and Jose Luis Vicente-Villardon

Keywords: STATIS, Dual-STATIS, Categorical data, Polychoric

Abstract: STATIS-ACT (des Plantes, 1976; Lavit et al., 1994) is a statistical technique used to extract common structures from multiple data tables. While the standard version of STATIS is suitable for continuous data and binary data with shared individuals (DISTATIS), a dual version is needed for analyzing categorical data with shared variables. To address this gap, we propose a novel approach called Polychoric Dual-STATIS. This method leverages polychoric correlations, which measure the association between ordinal categorical variables, to construct a consensus correlation matrix. By applying the dual STATIS framework to this matrix, we can uncover common patterns and relationships across multiple datasets. We illustrate the utility of Polychoric Dual-STATIS using a real-world dataset containing responses to the Rosenberg Self-Esteem Scale. This dataset provides a valuable opportunity to explore the application of our method to a practical problem. Polychoric Dual-STATIS offers a powerful and flexible tool for analyzing categorical data. By extending the STATIS framework, we have developed a method that can effectively extract common structures from multiple data tables, enhancing our understanding of complex relationships within and across datasets.

Bibliography:

des Plantes, H. L. (1976). Structuration des tableaux à trois indices de la statistique: théorie et application d’une méthode d’analyse conjointe. Université des sciences et techniques du Languedoc.

Lavit, C., Escoufier, Y., Sabatier, R., & Traissac, P. (1994). The ACT (STATIS method). Computational Statistics & Data Analysis, 18(1), 97–119. https://doi.org/10.1016/0167-9473(94)90134-1

José Luis Vicente Villardón

Title: Logistic Biplots for Mixed Types of Data

Authors: José Luis Vicente Villardón, Laura Vicente González and Elisa Frutos Bernal

Keywords: biplot, categoriale data, multivariate analysis

Abstract: A biplot provides a joint graphical representation of individuals and variables within a data matrix. For binary, nominal, or ordinal variables, however, the traditional linear biplot is insufficient. Recently, new biplot approaches based on logistic response models have been developed for categorical data. In these methods, individual and variable coordinates are calculated to yield logistic responses along the biplot dimensions. Similar to how Classical Biplot Analysis (CBA) relates to linear regression, these methods—known as Logistic Biplots (LB)—parallel logistic regression. Just as linear biplots connect to Principal Component Analysis, Logistic Biplots are associated with Item Response Theory. This paper introduces a method to represent both continuous and categorical data (binary, nominal, and ordinal) within a biplot framework.

Nirian Martín

Title: Testing for Complete Independence in High Dimensions through Rao’s Score Tests

Authors: Nirian Martín

Abstract: Rao’s score tests are not typically associated with high-dimensional statistical models. In order to test for complete independence in high dimensions, Schott (2005) proposed a method that was not evidently similar to any classical statistical testing procedures for \(p\)-dimensional normal \(n\) observations (\(p \ge n\)). This work demonstrates that the construction of the Rao score test and the determination of its (n, p)-asymptotic distribution represent an alternative approach to obtaining the same test. Furthermore, the methodology is extended to accommodate data sets with elliptical distributions, with a particular focus on normal scale mixture distributions as outlined by Muirhead and Waternaux (1980). The performance of the newly proposed Rao’s score test is studied through a simulation study as well as with a real data application across various scenarios of sample and dimension sizes, covering both the usual dimensional constraints of \(p < n\) and the high-dimensional case \(p \geq n\).

Fabio Scielzo-Ortiz

Title: New clustering algorithms for large mixed-type data

Authors: Fabio Scielzo-Ortiz and Aurea Grané

Keywords: clustering, fast k-medoids, generalized Gower, multivariate heterogeneous data, outliers, robust Mahalanobis

Abstract: In this work new robust efficient clustering algorithms for large datasets of mixed-type data are proposed and implemented in a new Python package called FastKmedoids. Their performance is analyzed through an extensive simulation study, and compared to a wide range of existing clustering alternatives in terms of both predictive power and computational efficiency. MDS is used to visualize clustering results.

Eva Boj

Title: Robust distance-based generalized linear models: A new tool for classification

Authors: Eva Boj, Aurea Grané and Agustín Mayo-Íscar

Keywords: robust metrics, weighted data, minimum covariance determinant, dbglm, dbstats, R

Abstract: Distance-based generalized linear models are prediction tools which can be applied to any kind of data whenever a distance measure can be computed among units. In this work, robust ad-hoc metrics are proposed to be used in the predictors’ space of these models, incorporating more flexibility to this tool. Their performance is evaluated by means of a simulation study and several applications on real data are provided. Computations are made using the dbstats package for R.

Beatriz Sinova

Title: Empirical finite-sample performance of fuzzy S-estimators

Authors: Beatriz Sinova and Stefan Van Aelst

Keywords: Fuzzy-valued data, S-estimator, empirical study, robustness

Abstract: Fuzzy-valued data are very useful to cope with imprecise attributes. Several statistical techniques can be found in the literature to analyze fuzzy-valued data sets, but most of them are based on a non-robust measure of central tendency (the Aumann-type mean) and dispersion (the corresponding standard deviation). Even when these measures satisfy many useful properties, the lack of robustness means that conclusions become unreliable under data contamination due to outliers, errors or data changes. Among other central tendency measures that have been proposed in the literature and present more robust behaviour, M-estimators of location provided the best empirical performance in many scenarios.

The disadvantage about the practical use of M-estimators of location is that no scale estimator is naturally available for them. S-estimators were introduced in the real-valued settings to solve this problem: while the behaviour of S-estimators of location is similar to that of M-estimators of location, in this case there is an associated S-estimator of scale. The main aim of this work is to adapt the notion of S-estimator to the fuzzy-valued framework.

As a preliminary analysis of the robustness of this proposal, a simulation study has been conducted for trapezoidal fuzzy numbers. Under different scenarios, fuzzy S-estimators of location and scale have been compared to the Aumann-type mean/standard deviation and fuzzy trimmed means/associated trimmed scale. We observed that, even when the trimming proportion is large enough to cope with all the contamination, the proposed S-estimator of location shows a similar behaviour as the fuzzy trimmed mean, whereas the S-estimator of scale outperforms the other scale measures, so fuzzy S-estimators of location and scale overall show the most stable performance.

Paula de la Lama Zubirán

Title: Exploring L 1 -Norm Penalization Approaches in Convex Clustering for Compositional Data

Authors: Paula de la Lama Zubirán and Jordi Saperas-Riera

Keywords: L1 Norm, Convex Clustering, CoDa

Abstract: Significant advancements have been made in studying space metrics in compositional data (CoDa) using Aitchison geometry, which provides a robust framework for analyzing data composed of interdependent parts of a whole. As inherently multivariate data, CoDa resides in the Simplex, a constrained space where each observation reflects relative proportions that sum to a constant. This study contributes to the field by performing a detailed comparative analysis of various norms, focusing on the L1-norm, and applying clustering algorithms specifically adapted to multivariate CoDa structures. In the objective function of our clustering algorithm, we introduce distinct approaches to the L1 norm in the penalization term: two variations of the L1-ilr-norm (ilr based on Principal Components and ilr created with a Sequential Binary Partition (SBP) default of CoDaPAck), the L1-clr norm, and the L1-CoDA norm. Using a dataset that includes the percentage composition of milk from 24 different mammals, based on five constituents, we evaluate the effectiveness of these metrics for grouping data within the Simplex. Our findings demonstrate that the induced CoDa L1-norm variants uphold fundamental compositional properties, such as scale invariance, facilitating meaningful data interpretation. The simulation results provide valuable insights into compositional analysis, underscoring the role of tailored norms in advancing multivariate CoDa classification and highlighting future methodological directions.

Michele Gallo

Title: Tensor Data Analysis

Authors: Michele Gallo

Abstract: Tensors provide a powerful mathematical framework for managing and analyzing large datasets, making tensor data analysis a rapidly growing field in areas such as machine learning, signal processing, computer vision, graph analysis, and data mining. Starting from fundamental concepts in tensor algebra, we explore the key models used to decompose multidimensional arrays also called hyper-matrices. Additionally, we examine the principal algorithms for estimating the parameters of these models. Both simulated data and real- world case studies are utilised to demonstrate the potential and applicability of this methodology across various domains.

José García-García

Title: On the consistency of the bootstrap analogue estimator of Cronbach’s α coefficient for interval-valued data in questionnaires

Authors: José García-García and M. Asunción Lubiano

Keywords: Cronbach’s α coefficient, Interval-valued rating scale, Random intervals, Bootstrap, Consistency

Abstract: In recent times, the so-called interval-valued rating scales have become an efficient alternative to conventional single-point scales such as Likert-type and visual analogue ones for assessing people’s attitudinal, behavioral, and health-related traits in questionnaires. By allowing respondents to select a whole range of values that adequately reflects their valuations within a pre-specified bounded interval, these psychometric instruments make it possible to simultaneously capture the individual differences in the responses and the inherent imprecision attached to human ratings, unlike the above-mentioned scales. However, the employment of these novel scales entails the use of new statistical methods to analyze the collected interval-valued data. In particular, for measuring the internal consistency reliability of interval-valued scale-based items, that is, for quantifying the extent to which the considered items and rating scales allow to obtain responses that are congruous to each other, the definition of the well-known Cronbach α coefficient has been extended within the probabilistic and statistical framework. Such an extension follows a distance-based approach that uses the concept of random interval as a model for the mechanism that randomly generates such data. Consequently, general ideas for random elements taking values on metric spaces can be considered, which substantially simplifies the analysis and interpretation of a type of data that can be more complex but also richer and more informative.

Due to the lack of realistic and wide models for the distribution of random intervals, the distribution of the sample estimator of the extended Cronbach α coefficient cannot be derived in an exact way as in the numerical case. Nevertheless, its asymptotic normality has been recently demonstrated on the basis of some classic theorems from real-valued multivariate data analysis. Unfortunately, this asymptotic distribution, as well as the inferential methods that could be developed from it, require large samples sizes to provide accurate results. As it has been shown in some previous papers from the SMIRE+CoDiRE Research Group of the University of Oviedo (Spain) in connection with other relevant parameters of the (induced) distribution of random intervals (or more generally, of fuzzy random variables), the bootstrap can provide very appropriate approximations for smaller sample sizes. For this reason, in this work the bootstrap analogue estimator of the extended Cronbach α coefficient is to be formalized and its asymptotic validity or consistency is to be analyzed, this consistency being essential for ensuring the correctness of the inferential techniques that could be derived from it.

Pilar González-Barquero

Title: Regularized Cox regression in high-dimensional settings

Authors: Pilar González-Barquero, Rosa E. Lillo and Álvaro Méndez-Civieta

Keywords: Survival analysis, Cox regression, penalization, lasso, adaptive lasso

Abstract: In high-dimensional contexts, standard Cox regression models are unfeasible since there is an infinite number of possible solutions for the regression coefficients. Regularization techniques are necessary to address this issue, enhancing model interpretability and predictive accuracy. One widely used method is the Least Absolute Shrinkage and Selection Operator (lasso), which applies an L1 penalty to reduce some coefficients to zero. An alternative to lasso, adaptive lasso, was posteriorly proposed to correct its bias. This method increases the flexibility of the model by adding different weights to each variable. Although there are different proposals for calculating these weights, most of them are unfeasible in high-dimensional settings. This study proposes and evaluates several methods for determining adaptive Lasso weights, including Principal Component Analysis (PCA), ridge regression, univariate Cox regression, and the Random Survival Forest algorithm and introduces a procedure for selecting the optimal model or variable selection method for Cox regression.

Additionally, these techniques are applied to clinical and genomic data. Factors influencing survival are identified using a high-dimensional dataset, containing both clinical and genetic information from patients with triple-negative breast cancer (TNBC), an aggressive subtype of breast cancer associated with low survival rates.

Rafael Jiménez-Llamas

Title: Interpretable and Fair Logistic Regression via Variational Inference

Authors: Rafael Jiménez-Llamas, Emilio Carrizosa Priego and Pepa Ramírez Cobo

Keywords: fairness, group sparsity, logistic regression, variational inference

Abstract: In this work, the usual Mean-Field Variational Inference approach to Logistic Regression is modified by minimizing the Kullback-Leibler divergence with an extra penalization term dependent on the unfairness of the prediction. To do so we define an appropriate unfairness metric to penalize unfair predictions for the new individuals which, in addition, is obtained in a private manner, thus enhancing applicability. On the other hand, we use a specific prior structure to induce group sparsity. As a result, the new prediction method allows the decision maker to have a triple trade-off between accuracy, fairness and sparsity/interpretability.

Victor Elvira

Title: State-space models as graphs

Authors: Victor Elvira

Abstract: Modeling and inference in multivariate time series is central in statistics, signal processing, and machine learning. A fundamental question when analyzing multivariate sequences is the search for relationships between their entries (or the modeled hidden states), especially when the inherent structure is a directed (causal) graph. In such context, graphical modeling combined with sparsity constraints allows to limit the proliferation of parameters and enables a compact data representation which is easier to interpret in applications, e.g., in inferring causal relationships of physical processes in a Granger sense. In this talk, we present a novel perspective consisting on state-space models being interpreted as graphs. Then, we propose novel algorithms that exploit this new perspective for the estimation of the linear matrix operator and also the covariance matrix in the state equation of a linear-Gaussian state-space model. Finally, we discuss the extension of this perspective for the estimation of other model parameters in more complicated models.

Juan José Egozcue

Title: Distances between detrital zircon U-Pb isotopic age distributions

Authors: Juan José Egozcue, Vera Pawlowsky-Glahn and Javier Fernández-Suárez

Keywords: Bayes-Hilbert spaces, Aitchison distance, weighted clr, multidimensional scaling

Abstract: In sedimentary provenance studies of detrital rocks or modern sediments, each sample provides hundreds to thousands zircon crystals of which a randomly picked selection is used for dating by U-Pb isotopes. The obtained ages, in million years (My), can be thought of as realizations of a random age whose distribution is of interest. Given a number of samples, in which commonly ca. 70-150 U-Pb ages per sample are obtained, the main problem is to determine dissimilarities between the age distributions amongst the rock samples in order to infer whether or not the compared samples are derived from the same parent population, i.e. the same source area.

There are several ways of determining the dissimilarities between age distributions such as the Kolmogorov-Smirnov (KS), Wasserstein (W2) or L2 distances between densities. After the dissimilarity analysis, results are used for a 2D-visualization using multidimensional scaling techniques. This kind of analysis provides useful geological insights and allows for interpretations that must be checked against other kinds of geological evidence. However, these techniques can be discussed from a purely statistical point of view. The main point is the sample space both of random age and random distribution of ages.

The U-Pb age of a zircon crystal is always positive and its scale can be considered relative. Then, deviations between ages can be measured by log-age differences and the random variable can be the log-age. Although this may seem unimportant, the mean and mode/s of the random variable change according to the adopted scale.

Furthermore, the log-age densities are random along the rock samples. Since probability densities are positive and are usually normalized to arbitrary constants (e.g. 1, 100), their modelling in L2 space is not consistent. The proposed alternative is the assumption of the Bayes-Hilbert space of densities on a limited log-age interval as their sample space. The advantage of this assumption is that it provides a complete Hilbert space structure where the group operation is the Bayes updating. In addition, inner product, norm and distance are available. The distance is called Aitchison distance as it is a generalization of the Aitchison distance for compositional data.

Bayes-Hilbert spaces admit changes of reference measure which works as a weighting along the log-age axis. In this fashion, log-age intervals where the observed log-ages are scarce (or absent) can be down-weighted.

This choice of sample space for the log-age densities suggests the following workflow to analyse available data:
(A) Kernel density estimation for each sample over the largest interval where log-ages are observed. (B) Select the sample mean of densities (geometric mean) as the density of the reference measure, which defines the weighting along the log-age axis. (C) Resample original data to account for uncertainty. (D) Kernel density estimation of observed and resampled data. (E) Express the weighted centered log-ratio (clr) of each observed and resampled density. (F) Compute Aitchison distances between densities as L2 distances of weighted clr’s. (G) Carry out a multidimensional scaling of observed and resampled densities and plot the result. A real data example is used to illustrate the procedure.

Víctor Velasco-Pardo

Title: Compositional receptor models for mutational signature analysis of cancer sequencing data

Authors: Víctor Velasco-Pardo, Michail Papathomas and Andy G. Lynch

Keywords: mutational signatures, cancer genomics, bayesian, compositional, bioinformatics

Abstract: Cancer is a disease driven and characterised by mutations in the DNA. Most somatic (non-inherited) mutations present in a cancer genome are “passengers” that are not involved in tumour development but bear fingerprints of the mutational processes that have been operative over a patient’s lifetime. Those fingerprints, termed “mutational signatures”, appear consistently across cancer genomes. Mathematically, they are probability mass functions that can be estimated in what is called a “de novo” mutational signature analysis. Relying on point estimates for those signatures, one can perform a “refitting” analysis to estimate a vector of weights characterising relative prevalence of each signature in an individual cancer genome. In this talk, we consider the intermediate setting  where partial information about the signatures is available, but they are not known precisely. We present a fully Bayesian method for mutational signature analysis based on compositional receptor modelling that allows one to update prior beliefs about the signatures and the vector of weights characterising a cancer genome. The method was implemented in stan with an interface to R and is available to researchers along with visualisation tools.

Saperas-Riera, Jordi

Title: LASSO Regression with L1-CoDa Norm

Authors: Saperas-Riera, Jordi, Mateu-Figueras, Glòria and Martín-Fernández, Josep Antoni

Keywords: Aitchison’s geometry, Compositional Data, Lp-norm, Balance selection

Abstract: LASSO regression methods utilize a penalty function expressed as a norm within the space of the model’s coefficients. This approach reduces the model’s complexity and helps prevent overfitting. Through the regularization process, LASSO can assign zero values to certain coefficients, thereby facilitating model interpretation and highlighting the most relevant variables. An additional challenge arises in the context of models that incorporate compositional covariates: the penalisation norm must align with Aitchison’s geometry. This geometry is essential in the compositional data analysis, as such data are subject to constant-sum constraints.

This contribution focuses on exploring the L1-CoDa norm as a penalty term in LASSO regression within a compositional context. The L1-CoDa norm is distinguished by its ability to select pairs of logratios between parts, which is crucial for identifying and analyzing significant relationships among the variables. Therefore, a rigorous definition of the L1-CoDa norm is presented, taking into account the unique geometric structure of the compositional sample space.

The application of the LASSO regularization technique with the L1-CoDa norm is illustrated using a real dataset on physical activity. In particular, the relationship within the 24-hour composition is investigated, divided into four parts: sedentary time (S), light physical activity (L), moderate-vigorous physical activity (MV), and sleep (Sp). The continuous response variable will be the z-body mass index (zBMI). This application demonstrates how the proposed method facilitates differentiation between the pairwise logratios that influence the response variable and those that are irrelevant. Additionally, the model is used to infer changes in zBMI when reallocating 15 minutes of time.

Balance selection complements variable selection in the analysis of compositional data. While variable selection simplifies models by eliminating redundant variables, balance selection shifts the focus to the relative information among components, strategically pruning pairwise logratios and balances that have minimal influence on the response variable. This process reveals a subcompositional structure that demonstrates internal independence concerning the explained variable. Balance selection is particularly interesting in cases where the explanatory variable consists of few parts or when the loss of variables is not relevant. Moreover, as proposed in this case, balance selection can estimate a linear model where the full balance plays a relevant role, something that cannot be achieved solely through conventional variable selection.

Lyvia Biagi

Title: Exploring a New Visualization Approach for Glucose Management: Day Categories via Compositional Data

Authors: Lyvia Biagi, Arthur Bertachi, Alvis Cabrera, Júlia Soler, Josep Antoni Martín-Fernández and Josep Vehi

Keywords: Compositional Data (CoDa), Glucose Management, Type 1 Diabetes, Continuous Glucose Monitoring (CGM).

Abstract: Managing insulin dosing in type 1 diabetes (T1D) requires personalized adjustments due to factors like carbohydrate intake, body weight, and glucose variability influenced by circadian rhythms and dietary intake. Despite advances in continuous glucose monitoring (CGM) and continuous subcutaneous insulin infusion (CSII), intra-patient variability complicates dosing adjustments. Integrating clinical data into decision support systems (DSS) may enhance physician-driven adjustments, supporting individualized treatment planning. A key metric for tracking glycemic control is the time spent in, above, or below target glucose ranges. Increasing time within target ranges can improve glycemic stability. CGM data, commonly divided into five glucose ranges, can be effectively analyzed using compositional data (CoDa) methods to manage the relative proportions of time in each range. This study explores whether CoDa, by categorizing days based on the relative time spent in each range at a population level, can provide additional clinical insights for therapy adjustments. We analyzed CGM data from a simulated cohort of 49 patients over 180 days, using three analytical approaches. First, aggregate time analysis calculated the total time each patient spent within specific glucose ranges over the 180 days, with the arithmetic mean representing an average day as a baseline. Second, a CoDa-based geometric mean analysis was applied to capture the geometric mean of each patient’s 180 daily profiles, assessing daily variability more effectively than arithmetic means. Finally, a population-based categorization classified days into categories aligned with the cohort’s geometric center, enabling the exploration of daily patterns across patients and providing additional insights when combined with traditional metrics. Results showed that geometric mean analysis better captured daily variability compared to arithmetic means. Population-based categorization added a new dimension, identifying patterns of acceptable and improvable days per patient. This dual approach revealed proportions of days meeting or exceeding glycemic targets, offering clinicians a more comprehensive view of individual patient patterns. The integration of day categories with geometric and arithmetic mean metrics highlighted shared and unique patterns, suggesting population-based insights can complement individualized adjustments. While aggregate and geometric mean approaches provided different perspectives on glycemic control, day categorization facilitated visualization of variability within and across patients, supporting more precise insulin adjustments. These findings suggest that CoDa-based analysis, particularly geometric means, offers a valuable alternative to conventional time-in-range metrics, enabling better detection of daily glycemic variability. Population-based categorization adds clinically meaningful insights but may benefit from further refinement to address unique patient risks, such as hypo- or hyperglycemia. Future studies involving real patient data will validate these findings and explore how lifestyle factors, such as diet and physical activity, influence CoDa-derived measures. Incorporating these approaches into DSS tools could enhance therapy precision and glycemic control in T1D management.

Mia Hubert

Title: Robust principal components by casewise and cellwise weighting

Authors: Mia Hubert

Abstract: Principal component analysis (PCA) is a fundamental tool for analyzing multivariate data. Here the focus is on dimension reduction to the principal subspace, characterized by its projection matrix. The classical principal subspace can be strongly affected by the presence of outliers. Traditional robust approaches consider casewise outliers, that is, cases generated by an unspecified outlier distribution that differs from that of the clean cases. But there may also be cellwise outliers, which are suspicious entries that can occur anywhere in the data matrix. Another common issue is that some cells may be missing. We propose a new robust PCA method, called cellPCA, that can simultaneously deal with casewise outliers, cellwise outliers, and missing cells. Its single objective function combines two robust loss functions, that together mitigate the effect of casewise and cellwise outliers. The objective function is minimized by an iteratively reweighted least squares (IRLS) algorithm. Residual cellmaps and enhanced outlier maps are proposed for outlier detection. We also illustrate how the approach can be extended towards tensor data.