Metabolomics Normalization and Feature Selection

Published

22 Oct 2025

Load Libraries

Code

##################
# LOAD LIBRARIES #
##################
library(tidyverse)
library(readxl)

# BiocManager::install("limma")
library(limma)
library(mixOmics) 
library(edgeR)

library(cowplot)
library(RColorBrewer)
library(scales)
select <- dplyr::select
map <- purrr::map

# setwd("/Users/vilkal/work/Brolidens_work/Projects/Candida-omics/src")
#################
# COLOUR PALLET #
#################
pal <- c("#902267", "#ffa998", "#f588af") # #e05e85

This is a good ressource to read about PLS-DA https://divingintogeneticsandgenomics.com/post/partial-least-square-regression-for-marker-gene-identification-in-scrnaseq-data/

Overview:

Feature Selection Workflow for Multi-Omics Integration

Data QC & Preprocessing (per omics)
Clean, normalize, and scale each omics dataset independently.
Exploratory Supervised Check with PLS-DA (optional)
Use PLS-DA to confirm that class separation exists before applying sparsity.
Tune Sparse Model Parameters with tune.splsda()
Identify the optimal number of components (ncomp) and variables to select per component (keepX) via cross-validation.
Fit Final sPLS-DA Model
Build the sPLS-DA model using the tuned parameters.
Assess Model Performance & Stability with perf()
Evaluate classification accuracy, balanced error rate (BER), and stability of selected variables.
Extract Selected Features (per component and per omics)
Retrieve the most discriminant variables identified by the sPLS-DA model.
Save Feature Lists for Downstream Multi-Omics Integration
Use the selected features as input for integration frameworks such as DIABLO (block.splsda()).

1. Data QC & Preprocessing

Code

#############
# LOAD DATA #
#############
meta_path <- "/Users/vilkal/Downloads/Metadata_svamp.xlsx"
meta_data <- read_xlsx(meta_path, sheet = "Metadata", skip = 1)

matrix_data <- read_csv("../Tidy_data/Metabolomics/Metabolom_matrix.csv")
metadata <- read_csv("../Tidy_data/Metabolomics/metabolomics_metadata.csv")
annotation_table <- read_csv("../Tidy_data/Metabolomics/metabolomics_annotation.csv")

Cleaning up data

The metabolomic data was provided in a unformatted table, so compiling the data into appropriate format is required. Differences in sample number, metabolite nomenclature, etc.

Code

RAW <- as.data.frame(RAW)
head(colnames(RAW),30); head(RAW[,1],30)

# EXTRACT INFO
N <- grep("order",colnames(RAW)); M <- grep("1",RAW[,1])[1]-1
md <- RAW[(M+1):nrow(RAW),1:N]
an <- RAW[1:M,(N+1):ncol(RAW)]
M <- RAW[(M+1):nrow(RAW),(N+1):ncol(RAW)]

# HARMONIZE COMPOUND ANNOTATIONS
an <- rbind(an,colnames(an))
colnames(an) <- gsub(",",".",paste0("GCMS-",signif(as.numeric(an[6,],2)),"@",
    signif(as.numeric(an[5,],2))))
ann <- t(rbind(name=an[17,], HMDB=an[2,], KEGG=an[3,], PUBCHEM=an[4,], 
    Mz=an[6,], Rt = an[5,] , row.names = colnames(an) ))
colnames(an) <- an[17,]

# HARMONIZE SAMPLE NAMES
colnames(M) <- colnames(an)
rownames(md) <- md[,"svamp_ID"]
rownames(M) <- md[,"svamp_ID"]

# CONVERT TO MATRIX
M <- t(matrix(as.numeric(as.matrix(M)),nrow(M),ncol(M),dimnames = dimnames(M)))

# INCORPORATE EXTRA METADATA
pmd <- as.data.frame(readxl::read_excel("../data/metabolomics/svamp_ID_list.xlsx"))
md <- cbind(md, pmd[match(md$svamp_ID,md$svamp_ID),c("group_cross","group_longitudinal", "spatial_code")])
md[is.na(md)] <- ""

# SAVE OUTPUT
write.csv(M, "../results/metabolomics_matrix.csv", row.names=T)
write.csv(md, "../results/metabolomics_metadata.csv", row.names=T)
write.csv(ann, "../results/metabolomics_annotation.csv", row.names=T)

We start by removing reference compunds and quality controll samples, before log10 and quantile normalising

Filter samples

###############
# FILTER DATA #
###############
data <- matrix_data %>%
  # remove reference compunds
  filter(!grepl("Standard",.$Compund)) %>%
  # remove controll samples
  select(-starts_with("QC"))

# transforme to matrix
matrix <- data %>%
  column_to_rownames(var = "Compund") %>%
  as.matrix(.)

Normalize

# Normalize data
##################
# NORMALIZE DATA #
##################
# samples should be columns and features rows
norm <- normalizeQuantiles(log1p(matrix))

# Transpose for mixOmic analysis
matrix.t <- matrix %>%
  t()

matrix.t[1:5,1:4]

    Lactic acid Glycolic acid Alanine alpha-Hydroxybutyric acid
S01     78.1233       28.3764 37.4065                    0.1534
S02     79.0192       17.6899 55.2835                    2.4247
S03     72.6377       21.7367 31.1321                    0.0474
S04     52.0407       33.4765 27.9321                    0.6472
S05     53.3506       40.8274 29.9669                    0.2101

NB! the majority(?) of mixOmics functions expect a format of samples as rows and features as columns, thus the transformation is VERY important!

########################
# SAVE NORMALIZED DATA #
########################
norm %>% as_tibble(., rownames = "svamp_ID") %>%
  write_csv(., "../Results/MixOmic/Metabolom_Normalized.csv")
# norm <- read_csv("../Results/MixOmic/Metabolom_Normalized.csv")

Get group assignment

gr <- c("0"="ctrl", "1"="ctrl", "2"="int", "3"="int", "4"="C.a", "5"="C.a")

gr_data <- meta_data %>%
    rename(`Symptom score (0-5)` = "Symptom score (0-5) - ibland har de svarat olika skriftligt i frågeformuläret och muntligt vid inklusiion, då har jag valt den högsta scoren") %>%
  select(
    svamp_ID,
    `Clinical score (0-5)`,
    `Fungal culture: C. albicans (y/n)`,
    `Fungal culture: Non-albicans candida spp. (y/n)`,
    `Symptom score (0-5)`,
    `Recurring fungal infections > 2/year (y/n)`
  ) %>%
    mutate(
  group = case_when(
    `Fungal culture: C. albicans (y/n)` == "1" &
      `Symptom score (0-5)` >= 1 &
      `Recurring fungal infections > 2/year (y/n)` == "1" ~
      "RVVCpos",

    `Fungal culture: C. albicans (y/n)` == "0" &
      `Recurring fungal infections > 2/year (y/n)` == "1" ~
      "RVVCneg",

    `Fungal culture: C. albicans (y/n)` == "1" &
      `Symptom score (0-5)` == 0 ~
      "AS",

    `Fungal culture: C. albicans (y/n)` == "0" &
      `Recurring fungal infections > 2/year (y/n)` == "0" ~
      "Control",

    `Fungal culture: C. albicans (y/n)` == "1" &
      `Recurring fungal infections > 2/year (y/n)` == "0" ~
      "Candidapos",

    TRUE ~ NA_character_
  )
) %>%
  select("svamp_ID", group, everything())

gr <- tibble(ID = colnames(data)[-c(1)]) %>% 
    left_join(., gr_data, by =c("ID"="svamp_ID")) %>%
    mutate(
      pos = case_when(
        `Fungal culture: C. albicans (y/n)` == "1" |
        `Fungal culture: Non-albicans candida spp. (y/n)` == 1 ~
        "pos",
        TRUE ~ "neg"
        ), .after="group") %>%
    mutate(Clin_gr = gr[.$`Clinical score (0-5)`], .after ="group") %>%
    filter(!(is.na(.$Clin_gr)))

Check the distribution of samples per groups

table(gr$`Clinical score (0-5)`)


 0  1  2  3  4  5 
84 18 17 10  8  5

table(gr$Clin_gr)


 C.a ctrl  int 
  13  102   27

Remove samples with missing Clinical scores

matrix.t <- matrix.t[gr$ID,]
dim(matrix.t)

[1] 142  90

2. Exploratory Supervised Check with PLS-DA

Principal Component Analysis

pca.temp <- pca(matrix.t, ncomp = 3, scale = TRUE, center = TRUE)

plotIndiv(pca.temp, col = pal, group = gr$Clin_gr, ind.names = FALSE,
          legend = TRUE, size.title = rel(1),
          title = 'Metabolites, PCA comp 1-2')

Figure 1: Preliminary (unsupervised) analysis with PCA on the candida vaginal swab data. Samples are projected into the space spanned by the principal components 1 and 2. The culture positive samples are not clustered, meaning that the major source of variation cannot be explained by culture positve/negativ groups. Instead, samples seem to cluster according to an unknown source of variation.

# ggplot
# pca.temp$variates$X %>% 
#   as_tibble(., rownames = "samples") %>%
#   mutate(Candida= gr$Clin_gr) %>%
  
#   ggplot(aes(x=PC1, y = PC2)) +
#   geom_point(aes(color = Candida, shape = Candida), alpha = 0.7) +
#   theme_bw(base_size = 10) + coord_fixed(ratio = 1) +
#   labs(title = 'Metabolites, PCA comp 1-2') +
#   theme(legend.position = 'right', 
#         plot.margin = unit(c(0.1,0,0,0), "cm"),
#         legend.key.size = unit(0.4, "cm"),
#         #legend.spacing.x = unit(0.4, 'cm')
#         )

Although there is no clear seperation we can already now see that the C.a samples are more spread out along the x-axis compared to the controll groups

We observe almost no separation between the groupes defined by candida culturing in the PCA sample plot. This preliminary exploration teaches us two important findings:

The major source of variation is not attributable to Candida, but an unknown source (we tend to observe clusters of samples, but those are not explained by Candida groups).
A more directed (supervised) analysis is needed to separate the Candida groups, and we should expect that the amount of variance explained by the dimensions in the PLS-DA analysis will be small.

PLS-DA

# check that samples (rows) and groups are in agreement
dim(matrix.t)

[1] 142  90

length(gr$Clin_gr)

[1] 142

# look at the distribution of samples per group
# this is important to know what metric to look at (Overall/BER)
table(gr$Clin_gr)


 C.a ctrl  int 
  13  102   27

# run preliminary model
basic.plsda <- plsda(matrix.t, gr$Clin_gr, logratio = 'none', 
                          ncomp = 10)
# plot the explained variance per component
fig <- plotIndiv(basic.plsda, col = pal,
          comp = c(1,2),
          group = gr$Clin_gr,
          size.title = 8, ind.names = FALSE,
          legend = T,
          title = "PLS-DA Components 1 & 2")$graph

Figure 2: PCA seperation of groups based on untuned model with 10 principal components

# fig + scale_colour_discrete(name = "My Legend", labels = c("Label1", "Label2"))
# print(fig)

Interpretation of PCA vs PLS-DA

Normaly we would be able to observe improved clustering according to groups in the PLS-DA plot compared with PCA. Such an improvemnt is expected, since PLS-DA incorporates class information for each sample, unlike PCA.

From the plotIndiv() output, the axis labels indicate the amount of variation explained per component.
However, the interpretation of this value differs from PCA:
in PLS-DA, the model aims to maximise the covariance between components associated with X and Y, rather than maximising the variance in X alone.

still we did not se much of an improvemnt between the two plots, which could indicate that the metabolome is not very corrleated with clinincal candida groups.

3. Tune Sparse Model

Tuning of the model

🔧 `tune.splsda()`

What it does

Performs cross-validation to find the optimal model parameters for your sPLS-DA model.

Specifically, it tunes:

The number of components (ncomp)
The number of variables to select on each component (keepX)

It tests multiple combinations of these parameters and identifies the best-performing model based on classification error — often using the Balanced Error Rate (BER) as the performance metric.

Number of variables to select

We first define a grid of keepX values. For example here, we define a fine grid at the start, and then specify a coarser, larger sequence of values:

# Grid of possible keepX values that will be tested for each comp
list.keepX <- c(1:10,  seq(20, 100, 10))
list.keepX

 [1]   1   2   3   4   5   6   7   8   9  10  20  30  40  50  60  70  80  90 100

# This chunk takes ~ 2 min to run
# Some convergence issues may arise but it is ok as this is run on CV folds
set.seed(30) # For reproducibility with this notebook, remove otherwise

tune.splsda <- tune.splsda(matrix.t, gr$Clin_gr, ncomp = 5, validation = 'Mfold', 
                           folds = 5, dist = 'centroids.dist', 
                           test.keepX = list.keepX, nrepeat = 10)

The following command line will output the mean error rate for each component and each tested keepX value given the past (tuned) components.

# Just a head of the classification error rate per keepX (in rows) and comp
head(tune.splsda$error.rate)

      comp1     comp2     comp3     comp4     comp5
1 0.5110022 0.4895397 0.4953494 0.4913915 0.4908832
2 0.5120356 0.4884867 0.4937881 0.4928803 0.4875063
3 0.5294900 0.4884141 0.4943690 0.4919725 0.4875063
4 0.5397157 0.4876878 0.4946958 0.4916457 0.4859449
5 0.5448439 0.4885956 0.4975644 0.4926261 0.4859449
6 0.5466594 0.4898302 0.5000335 0.4938607 0.4859449

# To show the error bars across the repeats:
plot(tune.splsda, sd = TRUE, group = gr$Clin_gr)

Figure 4: Stability of variable selection from the sPLS-DA on the metabol data. We use a by-product from perf() to assess how often the same variables are selected for a given keepX value in the final sPLS-DA model. The barplot represents the frequency of selection across repeated CV folds for each selected gene for component 1 and 2. The genes are ranked according to decreasing frequency.

In the case that the first component already explains nearly all discriminatory signal, and the algorithm finds that adding another one doesn’t improve performance (i.e. error rate stays flat) one component will be selected.

However, two components are required to visualize the samples in a 2D space (Comp 1 vs Comp 2), for this reason the minimum number of compnent to choose are set to 2.

# The optimal number of components according to our one-sided t-tests
tune.splsda$choice.ncomp$ncomp

[1] 1

if(tune.splsda$choice.ncomp$ncomp == 1){ncomp <- 2}else{ncomp <- tune.splsda$choice.ncomp$ncomp}

# The optimal keepX parameter according to minimal error rate
tune.splsda$choice.keepX

comp1 comp2 comp3 comp4 comp5 
   80     4     2    60    30

4. Fit Final sPLS-DA Model

Final PLS-DA model

We now run our final PLS-DA model that includes five components:

final.splsda <- splsda(matrix.t, gr$Clin_gr, 
                       ncomp = ncomp, 
                       keepX = tune.splsda$choice.keepX)

keepX is of length 5 while ncomp is 2 
trimming keepX to [33mc(80,4)[39m

🎯 `perf()`

What it does

Evaluates the performance and stability of your final sPLS-DA model using repeated cross-validation.

Unlike tune.splsda(), this function does not tune parameters — it assesses how well your chosen model performs.

It computes metrics such as:

Classification accuracy
Balanced Error Rate (BER)
Stability of selected features across cross-validation folds

The results help you assess the robustness and reliability of your tuned model.

Metric	What it measures	When to use
Overall error rate	The proportion of all samples misclassified (1 − overall accuracy). It treats all samples equally.	Use when your classes are balanced (roughly equal size).
Balanced error rate (BER)	The average of the per-class error rates — i.e., it gives equal weight to each class regardless of class size.	Use when your classes are imbalanced or you want performance independent of class size.

# assess the performance of the sPLS-DA model using repeated CV
perf.splsda = perf(final.splsda,  
                  validation = 'Mfold', 
                  folds = 5, 
                  nrepeat = 50, # we recommend 50 
                  progressBar = FALSE) # TRUE to track progress

# extract the optimal component number
optimal.ncomp <- perf.splsda$choice.ncomp["BER", "max.dist"] 

plot(perf.splsda, overlay = 'measure', sd=TRUE) # plot this tuning

Figure 3: Tuning the number of components in PLS-DA on the metabolome expression data. Each line (solid = overall, dashed = BER) shows classification error as the number of PLS components increases (1 → 10). Each panel represents a different distance metric used in prediction. Bars show the standard deviation across the repeated folds.

For each component, repeated cross-validation (10 × 3-fold CV) is used to evaluate the PLS-DA classification performance — both overall error rate and balanced error rate (BER) — for each type of prediction distance:

Distance metric	Description	Type
max.dist	Usually performs well and is the default in `mixOmics::plsda`.	Linear
centroids.dist	Can be more robust when class separation is moderate.	Nonlinear
mahalanobis.dist	Accounts for within-class covariance — sometimes helps when classes are elliptically distributed.	Nonlinear

🧭 Comparison of `tune.splsda()` and `perf()` in mixOmics

Function	Step in Workflow	Main Purpose	What It Tunes or Evaluates	Key Outputs	When to Use
`tune.splsda()`	Model tuning (Step 1)	Optimize model parameters	• Number of components (`ncomp`) • Number of variables per component (`keepX`)	• `choice.keepX` (optimal variables per component) • `choice.ncomp` (optimal number of components) • `error.rate` across tested configurations	Before building the final model
`perf()`	Model evaluation (Step 2)	Assess predictive performance and feature stability	Uses the final tuned model to compute: • Classification accuracy • Balanced Error Rate (BER) • Feature stability	• `error.rate` (overall and per class) • `BER` • `features$stable` (variable stability)	After tuning, to validate model robustness

5. Assess Model Performance

Number of components in PLS-DA

The perf() function evaluates the performance of PLS-DA - i.e., its ability to rightly classify ‘new’ samples into their group category using repeated cross-validation. We initially choose a large number of components (here ncomp = 10) and assess the model as we gradually increase the number of components. Here we use 3-fold CV repeated 10 times. In Module 2, we provided further guidelines on how to choose the folds and nrepeat parameters:

Retaining only the BER and the centroid.dist, numerical outputs of interest include the final overall performance for 5 components:

perf.splsda$error.rate

$overall
       max.dist centroids.dist mahalanobis.dist
comp1 0.2645070      0.4015493        0.4015493
comp2 0.2630986      0.3884507        0.3539437

$BER
       max.dist centroids.dist mahalanobis.dist
comp1 0.5830590      0.4908759        0.4908759
comp2 0.5725373      0.4871577        0.5292012

perf.splsda$error.rate.class$centroids.dist

         comp1     comp2
C.a  0.4615385 0.4615385
ctrl 0.3162745 0.2954902
int  0.6948148 0.7044444

dist <- c('max.dist', 'centroids.dist') %>% 
  set_names(., c('Max distance', 'Centroids distance'))
background <- imap(dist, ~background.predict(final.splsda, 
                                     comp.predicted = ncomp,
                                     dist = .x) )

iwalk(background, ~plotIndiv(final.splsda, col = pal, comp = 1:2, group = gr$Clin_gr,
          ind.names = FALSE, title = paste0(.y),
          legend = TRUE,  background = .x))

6. Extract Selected Features

Variable selection and stability

During the repeated cross-validation process in perf() we can record how often the same variables are selected across the folds. This information is important to answer the question: How reproducible is my gene signature when the training set is perturbed via cross-validation?.

# Convert named vector to tibble
stable.comp1 <- perf.splsda$features$stable$comp1
stable.comp2 <- perf.splsda$features$stable$comp2

df <- tibble(
  Feature = c(names(stable.comp1), names(stable.comp2)),
  Stability = c(as.numeric(stable.comp1), as.numeric(stable.comp2)),
  comp = c(rep("comp1",length(stable.comp1)), rep("comp2",length(stable.comp2)))
) %>%
  arrange(desc(Stability), .by = "comp") %>%
  slice_head(., n = length(stable.comp2), by = "comp")

# arrange accoroding to value for individal facets
reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
    new_x <- paste(x, within, sep = sep)
    stats::reorder(new_x, by, FUN = fun, decreasing = T)
}
scale_x_reordered <- function(..., sep = "___") {
    reg <- paste0(sep, ".+$")
    ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
}

# Create ggplot bar chart
ggplot(df, aes(x = reorder_within(Feature, Stability, comp), y = Stability)) +
  geom_col(fill = "#f588af") +
  labs(
    x = "Variables selected across CV folds",
    y = "Stability frequency",
    title = "Feature stability"
  ) +
  scale_x_reordered() +
  theme_minimal() + facet_wrap(~comp, scales = "free") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 5: Stability of variable selection from the sPLS-DA on the metabol data. We use a by-product from perf() to assess how often the same variables are selected for a given keepX value in the final sPLS-DA model. The barplot represents the frequency of selection across repeated CV folds for each selected gene for component 1 and 2. The metabolites are ranked according to decreasing frequency.

plotIndiv(final.splsda, col = pal, ind.names = FALSE, legend=TRUE,
          comp=c(1,2), ellipse = TRUE, 
          title = 'sPLS-DA on SRBCT comp 1-2',
          X.label = 'sPLS-DA comp 1', Y.label = 'sPLS-DA comp 2') 

# plotIndiv(final.splsda, col = pal, ind.names = FALSE, legend=TRUE,
#           comp=c(2,3), ellipse = TRUE, 
#           title = 'sPLS-DA on SRBCT comp 2-3',
#           X.label = 'sPLS-DA comp 2', Y.label = 'sPLS-DA comp 3')

Figure 1: Sample plots from PLS-DA comp 1-2

Let’s see the top metabolites for each of the sPLS components

scale_y_reordered <- function(..., sep = "___") {
    reg <- paste0(sep, ".+$")
    ggplot2::scale_y_discrete(labels = function(x) gsub(reg, "", x), ...)
}

get_feature_group_mapping <- function(comp = "comp1") {
  p_comp <- final.splsda$loadings$X[, comp]   # feature loadings for that component

  # Compute contribution of each feature to each sample’s score
  # (This is essentially X * p_comp, since t = X %*% p)
  contrib <- matrix.t %*% diag(p_comp)
  colnames(contrib) <- colnames(matrix.t)

  # Calc mean contribution per feature to get group
  mean_contrib <- contrib %>% # features × components
    as_tibble(., rownames = "ID") %>%
    mutate(gr = gr$Clin_gr) %>%
    summarise(across(where(is.numeric), mean, na.rm = TRUE), .by="gr") %>%
    pivot_longer(-gr, names_to = "feature", values_to = "mean_contribution") #%>%
    #select(feature, gr) %>% deframe()

    # For each feature, pick the group with the highest mean contribution
  feature_group_map <- mean_contrib %>%
    slice_max(mean_contribution, n = 1, by=feature) %>%
    arrange(desc(abs(mean_contribution))) %>%
    select(feature, group = gr) %>% deframe() }

gr_map <- c("comp1","comp2") %>% 
  set_names() %>%
  imap(., ~get_feature_group_mapping(.x))

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
ℹ In group 1: `gr = "ctrl"`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

final.splsda$loadings$X %>%
  as_tibble(., rownames = "Metabolite") %>%
  pivot_longer(cols = starts_with("comp"),
               names_to = "component",
               values_to = "value") %>%
  #group_by(component) %>%
  slice_max(abs(value), n = 20, by = component, with_ties = FALSE) %>%
  mutate(Metabolite = ifelse(value == 0, strrep(" ", row_number()), Metabolite)) %>%
  mutate(Reg = 
        ifelse(value > 0 , "UP", "DOWN")) %>%
  mutate(Metabolite = reorder_within(as.character(Metabolite), value, component)) %>%
  #mutate(gr = gr_map[ Metabolite]) %>%
  rowwise() %>%  # allows per-row lookup
  mutate(gr = gr_map[[component]][[Metabolite]]) %>%
  ungroup() %>%


# plot
  ggplot(aes(value, Metabolite, fill = factor(Reg))) +
  geom_col(show.legend = FALSE, width = 0.9) +
  scale_fill_manual(values = pal) +
  facet_wrap(~component, scale='free') +
  scale_y_reordered() +
  labs(y = NULL)

Figure 5: Variable importance in projection (VIP) coefficients reflect the relative importance of each X variable for each X variate in the prediction model. VIP allows to classify the X-variables according to their explanatory power of Y. Predictors with large VIP, larger than 1, are the most relevant for explaining Y.

# reorder_within(Metabolite, value, component)
# fct_reorder(Metabolite, value)

A VIP score is a measure of a variable’s importance in the PLS-DA model. It summarizes the contribution a variable makes to the model. The VIP score of a variable is calculated as a weighted sum of the squared correlations between the PLS-DA components and the original variable.

splsda.vip <- vip(final.splsda)

# Convert to tidy (long) format for ggplot
vip_long <- splsda.vip %>%
  as_tibble(., rownames = "Metabolite") %>%
  pivot_longer(cols = starts_with("comp"),
               names_to = "Component",
               values_to = "VIP")

# Find top 5 metabolites (by mean VIP for each components)
top_metabolites <- vip_long %>%
  mutate(meanVIP = mean(VIP), .by=c("Metabolite", "Component")) %>%
  nest(data = -Component) %>%
  mutate(data = imap(
    data, ~ .x %>%
      arrange(., desc(meanVIP)) %>%
      slice_head(., n = 5) )) %>%
  unnest(data) 

clus <- c(RColorBrewer::brewer.pal(8,"Set2"),
          RColorBrewer::brewer.pal(8,"Paired"))

# Plot with ggplot
top_metabolites %>%
  ggplot(., aes(x = Component, y = VIP, fill = Metabolite)) +
    geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
    labs(title = "Variable Importance in the Projection (Top 5)",
         x = "Component",
         y = "VIP Score") +
    theme_minimal(base_size = 14) +
    scale_fill_manual( values = clus) +
    theme(
      legend.title = element_blank(),
      plot.title = element_text(face = "bold", hjust = 0.5)
    )

par( mfrow= c(1,2) )
p <- map(c(1,2), ~plotLoadings(final.splsda, comp = .x, 
             method = 'mean', contrib = 'max',
             size.name = 0.8, legend = FALSE, size.title = 1,
             ndisplay = 20, style = "ggplot2",
             title = paste0("Loadings of Component ",.x)) +
              scale_colour_discrete() )

'ndisplay' value is larger than the number of selected variables! It has been reseted to 4 for block X

# determine which OTUs were selected
selected.f.comp1 = selectVar(final.splsda, comp = 1)$name 
selected.f.comp2 = selectVar(final.splsda, comp = 2)$name 

# display the stability values of these OTUs
perf.splsda$features$stable[[1]][selected.f.comp1]


    Myo-Inositol 1-Monophosphate             Glycerol 3-phosphate 
                           1.000                            1.000 
                      Methionine                      Hypotaurine 
                           1.000                            1.000 
                Aminoadipic acid                 Pantothenic acid 
                           1.000                            1.000 
                 Glucuronic acid                    Phenylalanine 
                           1.000                            1.000 
                        Arabitol              Phosphoethanolamine 
                           1.000                            1.000 
                          Uracil                    Ascorbic acid 
                           1.000                            1.000 
                        Cysteine                    Aspartic acid 
                           1.000                            1.000 
                         Ribitol                 Oxoglutaric acid 
                           1.000                            1.000 
            Fructose 6-phosphate                     beta-Alanine 
                           1.000                            1.000 
                      Isoleucine     1,5-Anhydrosorbitol (1,5-AG) 
                           1.000                            1.000 
                   Glyceric acid                          Taurine 
                           1.000                            1.000 
                  Norepinephrine                  scyllo-Inositol 
                           1.000                            1.000 
                          Serine                          Alanine 
                           1.000                            1.000 
                    Fumaric acid                    Glutamic acid 
                           1.000                            1.000 
                          Valine                 Phenylethylamine 
                           1.000                            1.000 
                         Leucine              Glucose 6-phosphate 
                           1.000                            1.000 
     Dehydroascorbic acid (DHAA)                        Threonine 
                           1.000                            1.000 
               Pyroglutamic acid                         Tyramine 
                           1.000                            0.996 
                       Erythrose                    Glutaric acid 
                           1.000                            1.000 
                    myo-Inositol                         Oleamide 
                           1.000                            1.000 
                 Dehydroabietate                    Linoleic acid 
                           1.000                            1.000 
             N-Acetyl-methionine          Indoleacetic acid (IAA) 
                           1.000                            0.868 
  gamma-Aminobutyric acid (GABA)                          Glycine 
                           0.996                            1.000 
                      Putrescine                       Malic acid 
                           0.996                            1.000 
                      Cadaverine                           Fucose 
                           0.992                            0.992 
                          Ribose                   Glycyl-glycine 
                           0.992                            0.964 
                        Fructose                         Lactitol 
                           1.000                            1.000 
         N-Acetylneuraminic Acid                      Lactic acid 
                           0.996                            0.996 
                          Lysine                    Succinic acid 
                           0.960                            0.996 
                         Maltose                    Alpha-Lactose 
                           1.000                            1.000 
                     Maltotriose beta-Hydroxybutyric acid (3-HBA) 
                           1.000                            0.960 
                         Xylitol                    Threonic acid 
                           0.960                            0.976 
                   Hippuric acid                     Abietic acid 
                           1.000                            0.944 
                       Ornithine                       Creatinine 
                           1.000                            0.816 
                     Cholesterol        alpha-Hydroxybutyric acid 
                           0.900                            0.916 
                         Glucose                    Tartaric acid 
                           0.768                            0.952 
                    Elaidic acid     alpha-Hydroxyisovaleric acid 
                           0.760                            0.848 
                        Pectin 2                       Isomaltose 
                           0.788                            0.812 
                  chiro-Inositol                      Citric acid 
                           0.708                            0.544 
                  Pipecolic acid                      Fluconazole 
                           0.444                            0.588

perf.splsda$features$stable[[2]][selected.f.comp2]


Oxoglutaric acid         Fructose     Ethanolamine    Succinic acid 
           0.976            0.976            0.964            0.460

Load Libraries

Overview:

Feature Selection Workflow for Multi-Omics Integration

1. Data QC & Preprocessing

Cleaning up data

Filter samples

Normalize

Get group assignment

Check the distribution of samples per groups

Remove samples with missing Clinical scores

2. Exploratory Supervised Check with PLS-DA

Principal Component Analysis

PLS-DA

Interpretation of PCA vs PLS-DA

3. Tune Sparse Model

Tuning of the model

🔧 tune.splsda()

Number of variables to select

4. Fit Final sPLS-DA Model

Final PLS-DA model

🎯 perf()

🧭 Comparison of tune.splsda() and perf() in mixOmics

5. Assess Model Performance

6. Extract Selected Features

Variable selection and stability

🔧 `tune.splsda()`

🎯 `perf()`

🧭 Comparison of `tune.splsda()` and `perf()` in mixOmics