2 Prepare Data

If you have average normalized expression and modeled expression probability for each cluster, this section will demonstrate how to prepare them for visualization with scMarco.

If you use Seurat to process your data, cluster average expression can be retrieved with AverageExpression() ¹.

This will render a data.frame where each row is a gene while each column is a cluster:

# A subset of data retrieved from Ozel et al. (2021)
lognorm <- read.csv(
  "../../example/data/log_norm/Adult.csv",
  row.names = 1,
  check.names = FALSE
)

head(lognorm)

           1 8    9 12 14 15 19 27 31
CG45784 0.00 0 0.00  0  0  0  0  0  0
CG45783 0.00 0 0.00  0  0  0  0  0  0
spok    0.00 0 0.00  0  0  0  0  0  0
Parp    1.00 1 0.99  1  1  1  1  1  1
Alg-2   0.00 0 0.00  0  0  0  0  0  0
Tim17b  0.69 1 1.00  1  1  1  1  1  1

scMarco uses a slightly different organization: A data.frame with 4 columns:

gene: Gene symbols or IDs
cluster: Cluster names or IDs
value: The value of average expression or expression probability
stage: The stage or condition
type: lognorm for average expression; prob for expression probability

We can convert the above table with tidyr:

library(tidyr)

# Since tidyr does not deal with row names, we need to keep genes as a column
lognorm$gene <- row.names(lognorm)

lognorm_pivot <- pivot_longer(
  lognorm,
  cols = -gene, # Do not pivot genes
  values_to = "value",
  names_to = "cluster"
)

head(lognorm_pivot)

# A tibble: 6 × 3
  gene    cluster value
  <chr>   <chr>   <dbl>
1 CG45784 1           0
2 CG45784 8           0
3 CG45784 9           0
4 CG45784 12          0
5 CG45784 14          0
6 CG45784 15          0

Once the conversion is completed, we annotate the stage (Adult) and data type (lognorm):

lognorm_pivot$stage <- "Adult"
lognorm_pivot$type <- "lognorm"

head(lognorm_pivot)

# A tibble: 6 × 5
  gene    cluster value stage type   
  <chr>   <chr>   <dbl> <chr> <chr>  
1 CG45784 1           0 Adult lognorm
2 CG45784 8           0 Adult lognorm
3 CG45784 9           0 Adult lognorm
4 CG45784 12          0 Adult lognorm
5 CG45784 14          0 Adult lognorm
6 CG45784 15          0 Adult lognorm

To preserve memory usage with a large dataset, scMarco stores data in a SQLite database, but no worries – this will just be two extra lines in R.

To interact with SQLite, we need two extra packages: DBI and RSQLite.

library(DBI)
library(RSQLite)

# This will create a new database if the file does not exist yet while
# connect to it if it exists already.
db <- dbConnect(SQLite(), "../../example/ex_db.sqlite")

scMarco supports dealing with multiple datasets as long as you give each of them a name. Here, we are going to call it example_optic_lobe.

If this is a new database and the table does not exist in your database, we can create the table with dbWriteTable().

dbWriteTable(
  conn = db, # The database you just opened/connected
  name = "example_optic_lobe",
  value = lognorm_pivot,
  # Regular user does not need this option below.
  # It is turned on to allow rebuilding the documentation without deleting
  # previously generated examples.
  overwrite = TRUE 
)

Now, we have the average expression data in the database. Similar process can be done with expression probability:

# A subset of data retrieved from Ozel et al. (2021)
mm <- read.csv(
  "../../example/data/mixture_model/Adult.csv",
  row.names = 1,
  check.names = FALSE
)

head(mm)

           1 8    9 12 14 15 19 27 31
CG45784 0.00 0 0.00  0  0  0  0  0  0
CG45783 0.00 0 0.00  0  0  0  0  0  0
spok    0.00 0 0.00  0  0  0  0  0  0
Parp    1.00 1 0.99  1  1  1  1  1  1
Alg-2   0.00 0 0.00  0  0  0  0  0  0
Tim17b  0.69 1 1.00  1  1  1  1  1  1

We convert and annotate the probability table as described above:

# Since tidyr does not deal with row names, we need to keep genes as a column
mm$gene <- row.names(mm)

mm_pivot <- pivot_longer(
  mm,
  cols = -gene, # Do not pivot genes
  values_to = "value",
  names_to = "cluster"
)

mm_pivot$stage <- "Adult"
mm_pivot$type <- "prob"

head(mm_pivot)

# A tibble: 6 × 5
  gene    cluster value stage type 
  <chr>   <chr>   <dbl> <chr> <chr>
1 CG45784 1           0 Adult prob 
2 CG45784 8           0 Adult prob 
3 CG45784 9           0 Adult prob 
4 CG45784 12          0 Adult prob 
5 CG45784 14          0 Adult prob 
6 CG45784 15          0 Adult prob

To store expression probability to an existing table (example_optic_lobe) that we just created, we use dbAppendTable() instead to append data.

dbAppendTable(
  conn = db, # The database you just opened/connected
  name = "example_optic_lobe",
  value = mm_pivot
)

[1] 108252

Now, we have a database ready for scMarko. If you have multiple stages or conditions, you need to repeat the above process and append all data into the same table.

Once we are done, we can close the connection to the database by dbDisconnect().

dbDisconnect(db)

Note that you need to set return.seurat = TRUE to get log-normalized average (Also see).↩︎