This is a good ressource to read about PLS-DA https://divingintogeneticsandgenomics.com/post/partial-least-square-regression-for-marker-gene-identification-in-scrnaseq-data/
Overview:
Feature Selection Workflow for Multi-Omics Integration
Clean up data
Compile the data into appropriate format.
Remove QC samples
Filter QC samples.
Normalize
Use quntile normalization from limma package.
Plot unsupervised UMAPs
Plot all metabolomics samples by UMAP across various meta data groups.
Plot shankey plot
Plot the relationship between samples for different meta data groups.
The metabolomic data was provided in a unformatted table, so compiling the data into appropriate format is required. Differences in sample number, metabolite nomenclature, etc.
Code
############## LOAD DATA ###############--- Load sample info ----smpl_IDs <-read_xlsx("../data/svamp_ID_list.xlsx") %>%select(svamp_ID, barcode_metabolomics)#--- Load raw file from metabolomics sequencing facility ----mbol_path <-"../data/Metabolomics/UTF-8SMC_P0800_GCMS_Data.xlsx"RAW <-read_xlsx(mbol_path)############### CLEAN DATA ################ this code is based on Paulos analysis of the metabolomics data# View first few column names and first column valueshead(colnames(RAW), 30)head(RAW[[1]], 30)#--- 1. Identify structure of the table ----# Find where annotation columns end (contains "order" keyword)col_boundary <-which(str_detect(colnames(RAW),regex("order", ignore_case =TRUE)))# Find where sample data begins (first row that starts with "1" in first column)row_boundary <-which(RAW[[1]] =="1")[1] -1#--- 2. Split into three logical components ----# Sample metadata (left block, below header)metadata <- RAW %>%slice((row_boundary +1):n()) %>%select(all_of(1:col_boundary))# Compound annotations (top block, right side)annotations_raw <- RAW %>%slice(1:row_boundary) %>%select(1, all_of((col_boundary +1):ncol(RAW)))#select(1, all_of((col_boundary + 1):ncol(RAW)))# Measurement matrix (sample × metabolite intensities)matrix_data <- RAW %>%slice((row_boundary +1):n()) %>%select(`Compound Name`,"barcode_metabolomics"=`customer sample Identifier`, (col_boundary +1):ncol(RAW) ) %>%left_join(., smpl_IDs, by ="barcode_metabolomics") %>%relocate(svamp_ID, .before ="barcode_metabolomics") %>%mutate(svamp_ID =ifelse(grepl("^QC\\d", .$`Compound Name`), .$`Compound Name`, .$svamp_ID ) ) %>%select(-`Compound Name`, -barcode_metabolomics) %>%pivot_longer(cols =-svamp_ID, names_to ="Compund") %>%pivot_wider(names_from = svamp_ID) %>%select(Compund, starts_with("QC"), sort(colnames(.)))#--- 3. Build tidy annotation table ----# Extract selected annotation rows and pivot them into a tidy formatannotation_table <- annotations_raw %>%#rename(row_label=`GCMS-NA@NA`) %>%slice(., c(2:6)) %>%# rows with HMDB, KEGG, PubChem, Mz, Rtpivot_longer(-1, names_to ="names", values_to ="value") %>%pivot_wider(names_from ="Compound Name", values_from = value) %>%# Create uniqe compound IDs by using M/z and Rtmutate(Compound_ID =paste0("GCMS-",round(as.numeric(`M/z (unique)`)),"@",round(as.numeric(`RI (resolved)`)) ) )#--- 4. Remove quality control compunds ----annotation_table <- annotation_table %>%filter(!grepl("Standard", .$names))# Results:# metadata → sample metadata (e.g. sample ID, run order)# matrix_data → numeric matrix (samples × compounds)# annotation_table → clean annotation metadata per compound############## SAVE DATA ##############write_csv(matrix_data, "../Tidy_data/Metabolomics/Metabolom_matrix.csv")write_csv(metadata, "../Tidy_data/Metabolomics/metabolomics_metadata.csv")write_csv( annotation_table,"../Tidy_data/Metabolomics/metabolomics_annotation.csv")
################ FILTER DATA ################data <- matrix_data %>%# remove reference compundsfilter(!grepl("Standard", .$Compund)) %>%# remove controll samplesselect(-starts_with("QC"))# transforme to matrixmatrix <- data %>%column_to_rownames(var ="Compund") %>%as.matrix(.)
Step 3: Normalize data
# Normalize data################### NORMALIZE DATA #################### samples should be columns and features rowsnorm <-normalizeQuantiles(log1p(matrix))# Transpose for mixOmic analysismatrix.t <- matrix %>%t()matrix.t[1:5, 1:4]