Research scope

Compositional data analysis

The core research line of the group is the statistical analysis of compositional data, which are defined as (random) vectors of positive components whose sum is constant (i.e., 100, one, a million). These restrictions make standard statistical techniques to lose their applicability and classical interpretation. Although this is a known problem since the nineteenth century, no theoretically founded solution was proposed until the 80s. Then, Professor J. Aitchison gives, from a strictly statistical perspective, the first suggestions or principles to consistently analyze compositional data. From these first indications, it has been a posteriori found that the mathematical foundation of statistics is based on the definition of a specific geometry on the simplex (the support space of compositional data), from where it is possible to rigorously develop any classical statistical analysis (cluster analysis, discriminant analysis, factor analysis, regression models, etc). This implies that our research goals include both the development of proper statistical techniques for compositional data, and the understanding of the mathematics underlying these techniques, which belong to the fields of algebra and geometry, the theory of measure, or the differential and integral calculus on the simplex.  Obviously, this reasoning can be also applied to other support spaces, like R+, R2+, the (0,1) interval and many others, thus offering promising new ways to analyse constrained data sets.

In Geology, Petrology, Chemistry, Economy, Archeometry, etc., people usually work with vectors of data whose components represent the relative contribution of different parts in a whole. The main goal of the group is to move forward in the statistical analysis of compositional data and their mathematical foundation. This general goal has nowadays the following specific objectives:

1. Mathematical foundations of compositional data analysis. After the definition of compositions as equivalence classes, it is possible to apply to the quotient space of compositions the whole methodology that has been produced on the simplex, even to broaden it. In this way, these very techniques are precisely and rigorously founded, at the same time that old concepts of geometry, the theory of measure or differential and integral calculus, must be translated and interpreted on the compositional quotient space, or equivalently, on the simplex.
2. Orthogonality and independence on the simplex. Since the simplex with a proper metric becomes an Euclidean space, we can define orthogonal and orthonormal basis on it, as well as the associated isometric log-ratio transformations. Hence we suggest to study subcompositional independence as closely related to orthogonality of subspaces in the simplex.
3. Parametric cluster analysis of compositional data. During the last years we have dealt with non-parametric techniques of classification of compositional data, essentially based on the compositional distance introduced by Aitchison. Now we tackle some parametric methodologies of classification of compositional data, based on the hypothesis that the clusters are samples generated from distributions of aln class (aditive logistic normal). However, geochemical compositional data usually present null components, which are almost zero since their values are under the detection limit of the measurement tools. This implies that these null components should be previously replaced by non-zero values before any classification procedure can be applied. Now we are trying to apply the multiplicative substitution methology-introduced by J.A. Martín, a member of the group-to the parametric classification techniques.
4. Aditive logistic skew normal distribution (alsn). We have recently finished off the study and modelling of compositional data with the skew normal distribution, introduced by Azzalini (1996), by using the same log-ratio transformation technique originally applied by Aitchison to the normal distribution, and complementing it with some results from the theory of measure. Hence we introduced the alsn distributrion, and further studied their properties, specially those related to the real vector space structure of the simplex, and to subcompositions. This distribution family offers a promising complement to one of the lacks of additive logistic normal distributions: the fact that two (or more) amalgamated aln components, are seldom aln-distributed.
5. Goodness-of-fit tables for skew normal distributions. Since the skew normal distribution was so recently introduced, there are still no statistical tools devised to test whether these skewed models fit reasonably well with real data sets, specially against the classical symmetric normal model. This fact has forced the group to elaborate some specific tables to test this goodness of fit, which are nowadays being developed according to the methodology suggested by Stephens, for different sample sizes and significance levels.
6. Statistical analysis of spatially-dependent compositional data sets. It is rather usual in Geostatistics (from mining to environmental studies) to deal with compositional data showing spatial dependency. Until now, the standard cokriging techniques used in this framework are based on an extension of the transformation techniques suggested by Aitchison to the spatial case, but without considering the vector space structure of the simplex. Therefore, we plan to reformulate these techniques by taking into account the Euclidean metric introduced on the simplex, which we expect to solve some flaws detected in classical application of cokriging to log-ratio transformed variables.
7. Linear and non-linear models on the simplex. The most recent advances in the building on the simplex of an algebraic-geometric structure suggest the need to reinterpret-from a compositional point of view-the modelling techniques of linear and non-linear processes in terms of compositional processes. We have began with some real case studies from geological framework, producing very promising results.
8. Compositional software (CoDaPack). Since the beginning of the XXI century, the research group has developed a package called CoDaPack containing a set of routines oriented to users with minimum knowledge on computers, with the aim to be simple and easy to use. Through menus user communicates with the package and it returns both numerical and graphical outputs. The graphic outputs can be applied in 3D and you can zoom and rotate it.
Originally CoDaPack was associated, by means VisualBasic routines, with the software Excel so it that ran like a menu in Excel and the results were placed also in Excel sheets. Later, without leaving Excel, the graphics were improved by means of OpenGL.
Since May 2011 there is a new version of CoDaPack, 2.0, which does not depend on Excel. This version is programmed in Java and requires only to have installed the Java virtual machine (minimum version 1.5). This allows CoDaPack 2.0 to run under any computer with Java Virtual Machine whatever its Operating System. In particular the family of Mac computers from Apple and Unix operating systems can now run CoDaPack 2.0.
This package is constantly expanding with new routines and improved existent ones.

(soon)

(soon)