Formatting Rbec output from 2025-01-27 sequenced by BTK Turku using custom HAMBI Illumina v3 primers

Author

Milla Similä & Shane Hogle

Published

May 24, 2025

Abstract
Results are from the remainder of samples/conditions (all trios and quartets with 16, 64 and 256 ug/ml streptomycin) from Milla’s community assembly experiment.

1 Setup

1.1 Libraries

Show/hide code
library(tidyverse)
library(here)
library(fs)
library(archive)
library(scales)
source(here::here("R", "utils_generic.R"))
source(here::here("R", "communities", "amplicon", "utils_amplicon.R"))

1.2 Global variables

Show/hide code
data_raw <- here::here("_data_raw", "communities", "20250127_BTK_illumina_v3")
data <- here::here("data", "communities", "20250127_BTK_illumina_v3")
amplicontar_01 <- here::here(data_raw, "rbec_output_01.tar.gz")
amplicontar_02 <- here::here(data_raw, "rbec_output_02.tar.gz")

# make processed data directory if it doesn't exist
fs::dir_create(data)

# create temporary location to decompress
tmpdir_01 <- fs::file_temp()
tmpdir_02 <- fs::file_temp()

2 Read metadata

Show/hide code
mddf <- readr::read_tsv(here::here(data_raw, "20250127_BTK_illumina_v3_metadata.tsv")) %>% 
  mutate(sample = str_trim(sample))
spdf <- readr::read_tsv(here::here(data_raw, "sample_compositions.tsv")) %>% 
  mutate(sample = str_trim(sample))

3 Read Rbec raw counts tables

3.1 Untar Rbec output tarball

Show/hide code
archive::archive_extract(
  amplicontar_01,
  dir = tmpdir_01,
  files = NULL,
  options = character(),
  strip_components = 0L
)

archive::archive_extract(
  amplicontar_02,
  dir = tmpdir_02,
  files = NULL,
  options = character(),
  strip_components = 0L
)

3.2 Setup directory structure

Show/hide code
tabdir_01 <- here::here(tmpdir_01, "rbec_output_01")
samppaths_01 <- fs::dir_ls(tabdir_01)
sampnames_01 <- path_split(samppaths_01) %>% 
  map_chr(dplyr::last)

tabdir_02 <- here::here(tmpdir_02, "rbec_output_02")
samppaths_02 <- fs::dir_ls(tabdir_02)
sampnames_02 <- path_split(samppaths_02) %>% 
  map_chr(dplyr::last)

3.3 Read

Show/hide code
straintabs_01 <- paste0(samppaths_01, "/strain_table.txt") %>% 
  set_names(sampnames_01) %>% 
  map_df(
  read_tsv,
  skip = 1,
  col_names = c("rRNA16S_locus","count"),
  show_col_types = FALSE, 
  .id = "sample")

straintabs_02 <- paste0(samppaths_02, "/strain_table.txt") %>% 
  set_names(sampnames_02) %>% 
  map_df(
  read_tsv,
  skip = 1,
  col_names = c("rRNA16S_locus","count"),
  show_col_types = FALSE, 
  .id = "sample")

3.4 Clean up

Show/hide code
# remove decompressed directory from temp location
fs::dir_delete(tmpdir_01)
fs::dir_delete(tmpdir_02)

4 Format

Show/hide code
straintabs_norm <- bind_rows(straintabs_01, straintabs_02) %>% 
  # Calls function that normalize counts by 16S copy number
  normalize_by_copy() %>% 
  # Calls function that completes all combinations of 23 species
  # this is important because some species go extinct that should be in the samples
  # and we need to have those in the final table
  complete_combos() %>% 
  # removes any extra whitespace on sample names
  mutate(sample = str_trim(sample))

Make final table

Show/hide code
# Later we will take advantage of the fact that for species not supposed to be
# in a sample the prior left_join will have filled the evo_hist category with an
# NA. We can then filter using this NA value
finaltable <- left_join(straintabs_norm, mddf, by = join_by(sample)) %>% 
  left_join(spdf, by = join_by(sample, strainID)) %>% 
  group_by(sample) %>% 
  dplyr::select(-genus, -species) %>% 
  mutate(f_raw = count_correct/sum(count_correct),
         sample_tot_raw = sum(count_correct)) %>% 
  ungroup() %>% 
  relocate(sample_tot_raw, f_raw, target_f, evo_hist, .after = count_correct)

5 Analysis

Let’s quickly compare and look the at number of reads from the different kinds of experiment categories. Negative controls are samples from the experiment that contained growth medium but no cells. We included them to make sure that there was not contamination between wells during the experiment. Positive controls contain a mixsture of all 23 species (not sure why this was done), and some wells are just empty but were processed/sequenced anyway for convenience of the person preparing the libraries. Experiment samples are those that came from the actual experiment, and masterplate samples are the defined mixtures we made to inoculate the experiment. We sequenced these masterplates so that we would have T0 information for each experimental treatment.

Show/hide code
finaltable %>% 
  dplyr::select(sample, sample_tot_raw, community_type) %>% 
  distinct() %>% 
  ggplot(aes(x = sample_tot_raw)) +
  geom_histogram(bins = 20, aes(fill = community_type)) +
  scale_x_log10() + 
  annotation_logticks(sides = "b", color="grey30") +
  labs(y = "Sample count", x = "Total reads per sample")  +
  facet_wrap(~ community_type, scales="free_y", nrow = 2)

Figure 1: Frequency distribution of total read counts per sample in the experimental samples (red) and the negative controls (blue).

The smallest number of reads in the experimental samples is 565 and the mean is 1.5338^{4} which is considerably more than the smallest number of reads 298 and the mean 1354.9408602 in the negative controls.

Overall this looks pretty good. Not as good as the sequencing batch from 20241024 though, but still decent. From the figure above we can see that there is a clear disjoint between about 3000 reads that separate samples that are negative controls and the experimental samples. However, there appears to be a negative control or two which has ~ 10000 reads and so it might have been contaminated.

5.1 Negative controls

We’ll check whether the experimental and plate negative controls look good. First lets check what species are contaminating negative controls.

Show/hide code
finaltable %>% 
  filter(str_detect(community_type, "^neg")) %>% 
  filter(f_raw > 0) %>%
  # here we'll use the maximum threshold identified in the histogram above. On
  # average we get a little less than 1000 reads for samples that contain no
  # cells/growth. This is just a technical artefact due to PCR steps involved in
  # multiplexing and library preparation
  filter(sample_tot_raw > 3000) %>% 
  contam_histogram(trans=TRUE, f_raw) +
  labs(x = "Frequency distribution of each species in the negative controls", y = "Sample count") +
  facet_wrap(~strainID)
Figure 2: Frequency distribution of all species in the negative controls

Most appear to be low abundance contaminants of species excluded from the experiment. Probably this is cross talk from the positive controls which included the entire mock community.

Show/hide code
finaltable %>% 
  filter(str_detect(community_type, "^neg")) %>% 
  filter(str_detect(strainID, "0403|1287|1896|1977")) %>% 
  filter(f_raw > 0) %>% 
  filter(sample_tot_raw > 3000) %>% 
  contam_histogram(trans=TRUE, f_raw) +
  labs(x = "Frequency distribution of each species in the negative controls", y = "Sample count")

Figure 3: Frequency distribution of only target target species in the negative controls

There seem to be much higher abundances of off-target species in this sequencing batch compared to the last one. This could be legitimate contamination or it could be down to weirdness in the library preparation.

Next we’ll investigate whether the negative control samples have low OD in the endpoint samples. This would indicate growth and that what we are seeing in the histograms is some kind of sequencing artifact and not real contamination.

Show/hide code
od <- read_tsv(here::here("_data_raw", "communities", "optical_density", "20240829_tqs", "optical_density_formatted_012025.txt")) %>% 
  mutate(well = stringr::str_replace(well,
                     pattern = "([0-9]+)", 
                     replacement = {\(x) stringr::str_pad(string = x, width = 2,  side = "left",  pad = "0")}))
Rows: 5761 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): sample, well
dbl (5): transfers, n_species, strep_conc, replicate, OD

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Show/hide code
left_join(finaltable, od, by = join_by(sample, transfers, strep_conc, replicate)) %>% 
  filter(str_detect(community_type, "^neg|^pos")) %>% 
  dplyr::select(sample, OD, well, community_type, transfers, replicate) %>% 
  distinct() %>% 
  filter(!is.na(OD)) %>% 
  ggplot(aes(x = OD)) +
  geom_histogram(bins = 10)

Figure 4: OD distribution of negative controls.

OK, so clearly the vast majority of negative controls have very low OD (OD < 0.1) and it is safe to assume that they are clean. However, there are ~10 that may have growth in them. We will need to flag these samples as likely contaminated.

Show/hide code
df <- left_join(finaltable, od, by = join_by(sample, transfers, strep_conc, replicate)) %>% 
  filter(str_detect(community_type, "^neg|^pos")) %>% 
  dplyr::select(sample, OD, well, community_type, n_species.x, n_species.y, transfers, replicate) %>% 
  distinct() %>% 
  filter(!is.na(OD)) %>% 
  arrange(desc(OD))

df %>% 
  left_join(finaltable, by = join_by(sample, community_type, transfers, replicate, n_species.x==n_species)) %>% 
  filter(f_raw > 0) %>% 
  filter(str_detect(strainID, "0403|1287|1896|1977")) %>% 
  relocate(sample_tot_raw)

This looks OK… Specifically, there are 186 negative control samples but only 8 (4.3 % of negative controls) have any appreciable OD contamination. I think this is quite good and a testament to how clean the experiment has been overall.

5.2 Positive controls

The positive community controls should each have all 23 species.

Show/hide code
finaltable %>% 
  filter(str_detect(community_type, "^pos") & count_correct > 0) %>% 
  group_by(sample) %>%
  mutate(n_sp_detected = n()) %>% 
  distinct(sample, sample_tot_raw, n_sp_detected)

All community positive controls have plenty of reads, have all 23 species detected, and otherwise look normal.

5.3 Misassigned reads

These libraries were only prepared with samples containing HAMBI_0403, HAMBI_1287, HAMBI_1896, and HAMBI_1977 so any time species other than these show up is just an incorrect assignment by Rbec or index leakage/crosstalk from the positive community controls. Let’s check quickly how many of these there are…

Show/hide code
finaltable %>% 
  # filter out positive community controls since they have all hambi species
  filter(!str_detect(community_type, "^pos")) %>% 
  filter(!str_detect(strainID, "0403|1287|1896|1977")) %>% 
  filter(count_correct > 0) %>% 
  arrange(desc(count_correct))

there are only a couple of random reads per experimental sample. Some samples have very high total read counts and it is likely that off-target species leaked into them due to crosstalk and index leakage.

5.4 Samples with few total reads

Some of the experimental pairs had streptomycin concentrations higher than any of the species individually could tolerate. We would expect then that no species would grow successfully in these samples and that the overall biomass would be very low, thus resulting in a low number of recovered reads from these samples.

To look into this. first let’s check which samples have very low OD600 in the endpoint samples. We’ll filter out samples with an OD of less than 0.1 or samples with fewer than 3000 reads. It is generally good practice to exclude samples with low number of reads.

Show/hide code
df <- left_join(finaltable, od, 
                 by = join_by(sample, transfers, strep_conc, replicate)) %>% 
  filter(transfers == 8) %>% 
  filter(OD < 0.1 | sample_tot_raw < 3000) %>% 
  distinct(sample, community_type, OD, sample_tot_raw) 

lowread_samps <- df %>% 
  filter(sample_tot_raw < 3000 & community_type == "experiment") %>% 
  distinct(sample) %>% 
  pull(sample)

df %>% 
  count(community_type)

the low OD samples are all negative or positive controls. However there are 11 samples from the main experiment that failed at the sequencing step (i.e., they had normal optical density)

5.5 Filter to target species

Exclude positive and negative controls and also exclude species that are not in the focal 4

Show/hide code
finaltable_exp_mstr <- finaltable %>% 
  filter(community_type %nin% c("empty", "neg_control", "pos_control")) %>% 
  filter(sample %nin% lowread_samps) %>% 
  filter(strainID %in% c("HAMBI_0403", "HAMBI_1287", "HAMBI_1896", "HAMBI_1977")) %>% 
  group_by(sample) %>% 
  mutate(f_raw_targetsp = count_correct/sum(count_correct),
         sample_tot_targetsp = sum(count_correct)) %>% 
  ungroup() %>% 
  relocate(c(sample_tot_targetsp, f_raw_targetsp), .after = f_raw)

5.6 Experimental samples

We may need to use a higher threshold than 1% (e.g., 3% frequency) because over 3% we can reliably say that a species is present and it is not due to index cross talk.

Show/hide code
finaltable_exp_mstr %>% 
  # there are only experiment samples left
  #filter(str_detect(community_type, "experiment")) %>% 
  filter(is.na(evo_hist)) %>% 
  contam_histogram(trans=TRUE, x = f_raw_targetsp) +
  labs(x = "Species frequency", y = "Sample count") +
  ggtitle("Frequency distribution of each species exceeding the 3% threshold")
Warning in scale_x_continuous(trans = "log10", labels = percent, guide =
guide_axis(angle = 90)): log-10 transformation introduced infinite values.
Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_bin()`).

Figure 5: Frequency distribution of each species in experimental samples where it should not occur and where it exceeds a 3% threshold.
Show/hide code
finaltable_exp_mstr %>% 
  # there are only experiment samples left
  #filter(str_detect(community_type, "experiment")) %>% 
  filter(is.na(evo_hist)) %>% 
  filter(f_raw_targetsp > 0.03 ) %>% 
  dplyr::select(sample, strainID, f_raw_targetsp) %>% 
  arrange(desc(f_raw_targetsp))

There are 63 samples (7% of 949 samples) that exceed a 3% crosstalk threshold. 57 samples (6% of 949 samples) have contamination <= 13% and 6 samples (1 % of 949 samples) have > 13% of a species that should not be there. In most cases the species in highly contaminated samples is HAMBI_1977. I think these 6 samples are probably at very high risk for legitimate contamination. It is likely that the rest of the 57 samples with <= 13% contamination are not contaminated but just crosstalk outliers.

5.6.1 Trios

Here just taking a closer look at the trios from the experiment that appear to be contaminated.

Show/hide code
finaltable_exp_mstr %>% 
  filter(str_detect(community_type, "experiment")) %>% 
  filter(n_species == 3) %>% 
  contaminated_barplot(threshold=0.03, quartet = FALSE, y=f_raw_targetsp) +
  ggtitle("Experiment trios with > 3% contamination")

Figure 6: Species composition of experiment trio samples that contain > 3% of a species that shouldn’t be there (shown in grey to black).
Show/hide code
contamtrios <- finaltable_exp_mstr %>% 
  filter(n_species == 3) %>% 
  filter(is.na(evo_hist)) %>% 
  filter(f_raw_targetsp > 0.13  ) %>% 
  pull(sample)

Only 6 trio samples look really bad (~50 % contamination). Again, I think we should only exclude samples where the contamination is very high (over ~50% of the sample) but those with around 10% or less contaminant I think can be retained, and I will discard the contaminating sequences. Interestingly the lower-level contamination (3% > contamination > 13%) seems to be mostly from from HAMBI_1896. The high level contamination is always in well C01 on 6 different plates (3_p03_C01, 3_p04_C01, 3_p05_C01, 3_p06_C01, 3_p07_C01, 3_p08_C01).

Show/hide code
finaltable_exp_mstr %>% 
  filter(str_detect(community_type, "experiment")) %>% 
  filter(n_species == 3) %>% 
  group_by(sample) %>% 
  mutate(id = cur_group_id()) %>% 
  ungroup() %>% 
  mutate(facet = ntile(id, 5)) %>% 
  contaminated_barplot(threshold = 0, quartet = FALSE, y=f_raw_targetsp) +
  ggtitle("All experimental trios") +
  facet_wrap(~facet, scales="free_x", nrow=5) +
  theme(
    strip.background = element_blank(),
    strip.text.x = element_blank(),
    legend.position = "bottom"
    )

Figure 7: Species composition of all experiment trio samples. Species that should not be present in a sample are shown in grey to black.

But at a birds-eye view this looks pretty good overall. There are a handful (6 samples) that appear to be badly contaminated but otherwise things look good.

5.6.2 Quartets

Again, one weakness of this approach is that there is no way to tell from this data whether a quartet is contaminated because every quartet (by definition) is inoculated with all of the 4 species at different frequencies and with different evolutionary histories. Thus we would expect to see either all 4 species present or a subset of the 4 species in the quartet samples, but we cannot tell if these patterns come from contamination or not. Moving forward we just have to assume that the quartets are OK… However, we’ve found so little contamination in our other experimental samples and all samples were handled in the experiment and processed for sequence library preparation in the same way. I think it is safe to assume that contamination in the quartets is minimal issue.

Show/hide code
finaltable_exp_mstr %>% 
  filter(str_detect(community_type, "experiment")) %>% 
  filter(n_species == 4) %>% 
  group_by(sample) %>% 
  mutate(id = cur_group_id()) %>% 
  ungroup() %>% 
  mutate(facet = ntile(id, 4)) %>% 
  contaminated_barplot(threshold = 0, quartet = TRUE, y=f_raw_targetsp) +
  ggtitle("All experimental quartets") +
  facet_wrap(~facet, scales="free_x", nrow=4) +
  theme(
    strip.background = element_blank(),
    strip.text.x = element_blank(),
    legend.position = "bottom"
    )

Figure 8: Species composition of all experiment quartet samples.

6 Export

Show/hide code
finaltable_exp_mstr_export <- finaltable_exp_mstr %>% 
  # first remove samples with low read count (< 3000 reads)
  filter(sample %nin% lowread_samps) %>% 
  # next remove trio samples that were highly contaminated
  filter(sample %nin% contamtrios) %>% 
  # exclude any remaining counts from species that shouldnt be there using again
  # the fact that evo_hist should be NA for these species
  filter(!is.na(evo_hist)) %>% 
  # because we set 3% as our limit of detection we set read counts of species
  # less than 1% to 0
  mutate(count_correct_thresh = if_else(f_raw_targetsp <= 0.03, 0, count_correct)) %>% 
  # now calculate a new relative abundance based only on the species that should
  # be present and that are > 3% relative abundance
  group_by(sample) %>% 
  mutate(f_thresh = count_correct_thresh/sum(count_correct_thresh)) %>% 
  ungroup() %>%
  # for marking which sequencing batch these came from 
  mutate(batch = "run20250127") %>% 
  dplyr::select(sample, strainID, evo_hist, count_correct_thresh, f_thresh,
                target_f_masterplate = target_f, replicate, strep_conc, 
                transfers, n_species, community_type, plate_well, batch)

6.1 Trios

Write the trio samples

Show/hide code
finaltable_exp_mstr_export %>% 
  filter(n_species == 3) %>% 
  arrange(sample) %>% 
  write_tsv(here::here(data, "trios_counts.tsv"))

6.2 Quartets

Write the quartet samples

Show/hide code
finaltable_exp_mstr_export %>% 
  filter(n_species == 4) %>% 
  arrange(sample) %>% 
  write_tsv(here::here(data, "quartets_counts.tsv"))