Formatting Rbec output from 2024-10-20 sequenced by BTK Turku using custom HAMBI Illumina v3 primers

Author

Milla Similä & Shane Hogle

Published

May 24, 2025

Abstract
Results from trios and quartets in 0 ug/ml streptomycin concentration and their masterplates from Milla’s community assembly experiment. Also includes a couple pairs that failed last time that we redid in this sequencing run.

1 Setup

1.1 Libraries

Show/hide code
library(tidyverse)
library(here)
library(fs)
library(archive)
library(scales)
source(here::here("R", "utils_generic.R"))
source(here::here("R", "communities", "amplicon", "utils_amplicon.R"))

1.2 Global variables

Show/hide code
data_raw <- here::here("_data_raw", "communities", "20241024_BTK_illumina_v3")
data <- here::here("data", "communities", "20241024_BTK_illumina_v3")
amplicontar <- here::here(data_raw, "rbec_output.tar.gz")

# make processed data directory if it doesn't exist
fs::dir_create(data)

# create temporary location to decompress
tmpdir <- fs::file_temp()

2 Read metadata

Show/hide code
mddf <- readr::read_tsv(here::here(data_raw, "20241024_BTK_illumina_v3_metadata.tsv")) %>% 
  mutate(sample = str_trim(sample))
spdf <- readr::read_tsv(here::here(data_raw, "sample_compositions.tsv")) %>% 
  mutate(sample = str_trim(sample))

3 Read Rbec raw counts tables

3.1 Untar Rbec output tarball

Show/hide code
archive::archive_extract(
  amplicontar,
  dir = tmpdir,
  files = NULL,
  options = character(),
  strip_components = 0L
)

3.2 Setup directory structure

Show/hide code
tabdir <- here::here(tmpdir, "rbec_output")
samppaths <- fs::dir_ls(tabdir)
sampnames <- path_split(samppaths) %>% 
  map_chr(dplyr::last)

3.3 Read

Show/hide code
straintabs <- paste0(samppaths, "/strain_table.txt") %>% 
  set_names(sampnames) %>% 
  map_df(
  read_tsv,
  skip = 1,
  col_names = c("rRNA16S_locus","count"),
  show_col_types = FALSE, 
  .id = "sample")

3.4 Clean up

Show/hide code
# remove decompressed directory from temp location
fs::dir_delete(tmpdir)

4 Format

Show/hide code
straintabs_norm <- straintabs %>% 
  # Calls function that normalize counts by 16S copy number
  normalize_by_copy() %>% 
  # Calls function that completes all combinations of 23 species
  # this is important because some species go extinct that should be in the samples
  # and we need to have those in the final table
  complete_combos() %>% 
  # removes any extra whitespace on sample names
  mutate(sample = str_trim(sample))

Make final table

Show/hide code
# Later we will take advantage of the fact that for species not supposed to be
# in a sample the prior left_join will have filled the evo_hist category with an
# NA. We can then filter using this NA value
finaltable <- left_join(straintabs_norm, mddf, by = join_by(sample)) %>% 
  left_join(spdf, by = join_by(sample, strainID)) %>% 
  group_by(sample) %>% 
  dplyr::select(-genus, -species) %>% 
  mutate(f_raw = count_correct/sum(count_correct),
         sample_tot_raw = sum(count_correct)) %>% 
  ungroup() %>% 
  relocate(sample_tot_raw, f_raw, target_f, evo_hist, .after = count_correct) %>% 
  # reformats sample names for pairs so that they are same as in the the batch from
  # 20240711
  mutate(sample = case_when(n_species == 2 ~ notes, 
                            TRUE ~ sample))

5 Analysis

Let’s quickly compare and look the at number of reads from the different kinds of experiment categories. Negative controls are samples from the experiment that contained growth medium but no cells. We included them to make sure that there was not contamination between wells during the experiment. Positive controls contain a mixsture of all 23 species (not sure why this was done), and some wells are just empty but were processed/sequenced anyway for convenience of the person preparing the libraries. Experiment samples are those that came from the actual experiment, and masterplate samples are the defined mixtures we made to inoculate the experiment. We sequenced these masterplates so that we would have T0 information for each experimental treatment.

Show/hide code
finaltable %>% 
  dplyr::select(sample, sample_tot_raw, community_type) %>% 
  distinct() %>% 
  ggplot(aes(x = sample_tot_raw)) +
  geom_histogram(bins = 20, aes(fill = community_type)) +
  scale_x_log10() + 
  annotation_logticks(sides = "b", color="grey30") +
  labs(y = "Sample count", x = "Total reads per sample") +
  facet_wrap(~ community_type, scales="free_y", nrow = 2)

Figure 1: Frequency distribution of total read counts per sample in the experimental samples (red) and the negative controls (blue).

The smallest number of reads in the experimental samples is 8295 and the mean is 17337.13 which is considerably more than the smallest number of reads 452 and the mean 888.68 in the negative controls.

This looks really good. The number of reads in the negative controls are always very low (similar to “empty” samples and always an order of magnitude less than masterplates and experimental samples). The number of reads in the negative controls will depend on the sequencing depth used and in this batch compared to the first one with pairs we have used a higher sequencing depth (or less multiplexing) so that the average number of reads is higher here.

5.1 Negative controls

First, we’ll check whether the experimental and plate negative controls look good

Lets check what species are contaminating negative controls. Most appear to be low abundance contaminants of species excluded from the experiment. Probably this is cross talk from the positive controls and other samples

Show/hide code
finaltable %>% 
  filter(str_detect(community_type, "^neg")) %>% 
  filter(f_raw > 0) %>%
  contam_histogram(trans=TRUE, x = f_raw) +
  labs(x = "Frequency distribution of each species in the negative controls", y = "Sample count") +
  facet_wrap(~strainID)

Figure 2: Frequency distribution of all species in the negative controls
Show/hide code
finaltable %>% 
  filter(str_detect(community_type, "^neg")) %>% 
  filter(str_detect(strainID, "0403|1287|1896|1977")) %>% 
  filter(f_raw > 0) %>% 
  contam_histogram(trans=TRUE, x = f_raw) +
  labs(x = "Frequency distribution of each species in the negative controls", y = "Sample count")

Figure 3: Frequency distribution of only target target species in the negative controls
Show/hide code
od <- read_tsv(here::here("_data_raw", "communities", "optical_density", "20240829_tqs", "optical_density_formatted_102024.txt")) %>% 
  mutate(well = stringr::str_replace(well,
                     pattern = "([0-9]+)", 
                     replacement = {\(x) stringr::str_pad(string = x, width = 2,  side = "left",  pad = "0")}))

All negative control have very low OD < 0.1 and it is safe to assume that they are clean.

Show/hide code
df <- left_join(finaltable, od, by = join_by(sample, transfers==transfer, replicate, strep_conc)) %>% 
  filter(str_detect(community_type, "^neg")) %>% 
  dplyr::select(sample, OD, well, community_type, n_species.x, n_species.y, replicate) %>% 
  distinct() %>% 
  filter(!is.na(OD))

All negative control have very low OD < 0.1 and low read counts it is safe to assume that they are clean.

Show/hide code
df %>% 
  left_join(finaltable, by = join_by(sample, community_type, replicate, n_species.x==n_species)) %>% 
  filter(f_raw > 0) %>% 
  filter(str_detect(strainID, "0403|1287|1896|1977")) %>% 
  dplyr::select(sample_tot_raw, sample, OD, community_type) %>% 
  distinct() %>% 
  pivot_longer(c(-sample, -community_type)) %>% 
  ggplot(aes(x = value)) +
  geom_histogram(bins = 5) +
  facet_grid(~name, scales = "free_x")

Figure 4: Optical density (OD) and total read count distributions of negative controls.

5.2 Positive controls

The positive community controls should each have all 23 species.

Show/hide code
finaltable %>% 
  filter(str_detect(community_type, "^pos") & count_correct > 0) %>% 
  group_by(sample) %>%
  mutate(n_sp_detected = n()) %>% 
  distinct(sample, sample_tot_raw, n_sp_detected)

2/3 community controls have plenty of reads and look normal. One has under 1000 reads in total and it only contains the 4 focal species (0403, 1287, 1896, 1977). It looks to me that this could be a mislabeled negative control?

5.3 Misassigned reads

These libraries were only prepared with samples containing HAMBI_0403, HAMBI_1287, HAMBI_1896, and HAMBI_1977 so any time species other than these show up is just an incorrect assignment by Rbec or index leakage/crosstalk from the positive community controls. Let’s check quickly how many of these there are…

Show/hide code
finaltable %>% 
  # filter out community controls since they have all hambi species
  filter(!(sample %in% c("4_p09_A01", "4_p10_A01", "4m_p18_A01" ))) %>% 
  filter(!str_detect(strainID, "0403|1287|1896|1977")) %>% 
  filter(count_correct > 0) %>% 
  arrange(desc(count_correct))

there are only very small number of reads mapping to off-target species per experimental sample probably from crosstalk leakage.

5.4 Samples with few total reads

Some of the experimental pairs had streptomycin concentrations higher than any of the species individually could tolerate. We would expect then that no species would grow successfully in these samples and that the overall biomass would be very low, thus resulting in a low number of recovered reads from these samples.

To look into this. first let’s check which samples have very low OD600 in the endpoint samples. We’ll filter out samples with an OD of less than 0.1 and also samples with fewer than 1000 reads. It is generally good practice to exclude samples with low number of reads.

Show/hide code
left_join(finaltable, od, 
          by = join_by(sample, transfers==transfer, replicate, strep_conc)) %>% 
  filter(transfers == 8) %>% 
  filter(OD < 0.1 | sample_tot_raw < 3000) %>% 
  distinct(sample, community_type, OD, sample_tot_raw) %>% 
  count(community_type)

Samples with low reads or low OD are all negative controls or the one positive control that failed (plus another positive control 4_p10_A01 but this must be from mislabeling because there were no positive controls actually on the experimental plates).

5.5 Filter to target species

Exclude positive and negative controls and also exclude species that are not in the focal 4

Show/hide code
finaltable_exp_mstr <- finaltable %>% 
  filter(community_type %nin% c("empty", "neg_control", "pos_control")) %>% 
  # no low-read samples to filter
  #filter(sample %nin% lowread_samps) %>% 
  filter(strainID %in% c("HAMBI_0403", "HAMBI_1287", "HAMBI_1896", "HAMBI_1977")) %>% 
  group_by(sample) %>% 
  mutate(f_raw_targetsp = count_correct/sum(count_correct),
         sample_tot_targetsp = sum(count_correct)) %>% 
  ungroup() %>% 
  relocate(c(sample_tot_targetsp, f_raw_targetsp), .after = f_raw)

5.6 Masterplate samples

5.6.1 Pairs

Something a bit weird is going on with two of the masterplate pairs. Samples P19 and P20 both somehow have ~20 of HAMBI_0403 present in them. However, the rest of the species composition looks like how we expect them. I think we will keep these two masterplate samples and just filter out the HAMBI_0403 that shouldn’t be there

Show/hide code
finaltable_exp_mstr %>%
  filter(str_detect(community_type, "masterplate") & n_species == 2)  %>% 
  dplyr::select(sample, strainID, f_raw_targetsp, target_f, evo_hist) %>% 
  mutate(f_raw_targetsp = round(f_raw_targetsp, 2)) %>% 
  distinct()
Show/hide code
finaltable_exp_mstr %>%
  filter(str_detect(community_type, "masterplate") & n_species == 2) %>% 
  contaminated_barplot(threshold = 0, quartet = FALSE, y=f_raw_targetsp) +
  ggtitle("Masterplate pairs with contamination")

Figure 5: Species composition of masterplate pairs that contain any percentage (> 0%) of a species that shouldn’t be there (shown in grey to black).

5.6.2 Trios

Trio masterplates look quite good. Yes, there are some samples where a species that shouldn’t be there is > 3%, but these are always < 5% and there are no samples with major contamination

Show/hide code
finaltable_exp_mstr %>%
  filter(str_detect(notes, "trios") & str_detect(notes, "masterplate")) %>% 
  contaminated_barplot(threshold = 0.01, quartet = FALSE, y=f_raw_targetsp) +
  ggtitle("Masterplate trios with contamination")

Figure 6: Species composition of masterplate pairs that contain >= 1% of a species that shouldn’t be there (shown in grey to black).
Show/hide code
finaltable_exp_mstr %>%
  filter(str_detect(notes, "trios") & str_detect(notes, "masterplate")) %>% 
  filter(is.na(evo_hist)) %>% 
  filter(f_raw > 0) %>% 
  contam_histogram(trans = FALSE, x = f_raw_targetsp) +
  labs(x = "Species frequency", y = "Sample count") +
  ggtitle("Frequency distribution of misassigned species in masterplate trios")

Figure 7: Frequency distribution of each species in masterplate samples where it should not occur. The 1% mark (commonly accepted as an acceptable Illumina cross-talk standard) is demarcated with a dashed line.
Show/hide code
finaltable_exp_mstr %>%
  filter(str_detect(notes, "trios") & str_detect(notes, "masterplate")) %>% 
  filter(!is.na(evo_hist)) %>% 
  filter(f_raw > 0) %>% 
  contam_histogram(trans = FALSE, x = f_raw_targetsp) +
  labs(x = "Species frequency", y = "Sample count") +
  ggtitle("Frequency distribution of focal species in masterplate trios")

Figure 8: Frequency distribution of each species in all samples where it should occur.

5.6.3 Quartets

One weakness of this contamination-detection approach is that there is no way to tell from this data whether a quartet masterplate is contaminated because every quartet (by definition) is inoculated with all of the 4 species at different frequencies and with different evolutionary histories. See Section 5.7.3 below for more information. However, we can still visualize the composition of those quartets and whether they seem to be close to what we expected to put in.

Show/hide code
finaltable_exp_mstr %>%
  filter(str_detect(notes, "quartets") & str_detect(notes, "masterplate")) %>% 
  contaminated_barplot(threshold = 0.01, quartet = TRUE, y = f_raw_targetsp) +
  ggtitle("All masterplate quartets")

Figure 9: Species composition of masterplate quartets.

These look really good and are clearly following the correct pattern/distribution for the quartets.

5.7 Experimental samples

We may need to use a higher threshold than 1% (e.g., 3% frequency) because over 3% we can reliably say that a species is present and it is not due to index cross talk.

Show/hide code
finaltable_exp_mstr %>% 
  filter(str_detect(community_type, "experiment")) %>% 
  filter(is.na(evo_hist)) %>% 
  filter(f_raw > 0.03 ) %>% 
  contam_histogram(trans=TRUE, x = f_raw_targetsp) +
  labs(x = "Species frequency", y = "Sample count") +
  ggtitle("Frequency distribution of each species exceeding the 3% threshold")

Figure 10: Frequency distribution of each species in experimental samples where it should not occur and where it exceeds a 3% threshold.
Show/hide code
finaltable_exp_mstr %>% 
  # there are only experiment samples left
  filter(str_detect(community_type, "experiment")) %>% 
  filter(is.na(evo_hist)) %>% 
  filter(f_raw_targetsp > 0.03 ) %>% 
  dplyr::select(sample, strainID, f_raw_targetsp) %>% 
  arrange(desc(f_raw_targetsp))

There are 11 samples (2% of 498 samples) that exceed a 3% crosstalk threshold. 9 samples (2% of 498 samples) have contamination <= 8% and 2 samples (0 % of 498 samples) have > 8% of a species that should not be there. In most cases the species in highly contaminated samples is HAMBI_1977. I think these 2 samples are probably at very high risk for legitimate contamination. It is likely that the rest of the 9 samples with <= 8% contamination are not contaminated but just crosstalk outliers.

5.7.1 Pairs

Here just taking a closer look at the pairs from the experiment that appear to be contaminated.

Show/hide code
finaltable_exp_mstr %>% 
  filter(str_detect(community_type, "experiment") & n_species == 2) %>% 
  contaminated_barplot(threshold = 0.03, quartet = FALSE, y=f_raw_targetsp) +
  ggtitle("Experiment pairs with > 3% contamination")

Figure 11: Species composition of pairs masterplates samples that contain > 3% of a species that shouldn’t be there (shown in grey to black).

All experimental pair samples look pretty good. I think we should only exclude samples where the contamination is very high (over ~50% of the sample) but those with around 10% or less contaminant I think can be retained. So we will not need filter any of these out.

5.7.2 Trios

Here just taking a closer look at the trios from the experiment that appear to be contaminated.

Show/hide code
finaltable_exp_mstr %>% 
  filter(str_detect(community_type, "experiment") & n_species == 3) %>% 
  contaminated_barplot(threshold=0.03, quartet = FALSE, y=f_raw_targetsp) +
  ggtitle("Experiment trios with > 3% contamination")

Figure 12: Species composition of experiment trio samples that contain > 3% of a species that shouldn’t be there (shown in grey to black).

Only 2 trio samples look really bad (over 10 %, more like 50 % contamination). Again, I think we should only exclude samples where the contamination is very high (over ~50% of the sample) but those with around 10% or less contaminant I think can be retained, and I will discard the contaminating sequences.

Show/hide code
contamtrios <- finaltable_exp_mstr %>% 
  filter(str_detect(community_type, "experiment") & n_species == 3) %>% 
  filter(is.na(evo_hist)) %>% 
  filter(f_raw_targetsp > 0.13  ) %>% 
  pull(sample)

5.7.3 Quartets

Again, one weakness of this approach is that there is no way to tell from this data whether a quartet is contaminated because every quartet (by definition) is inoculated with all of the 4 species at different frequencies and with different evolutionary histories. Thus we would expect to see either all 4 species present or a subset of the 4 species in the quartet samples, but we cannot tell if these patterns come from contamination or not. Moving forward we just have to assume that the quartets are OK… However, we’ve found so little contamination in our other experimental samples and all samples were handled in the experiment and processed for sequence library preparation in the same way. I think it is safe to assume that contamination in the quartets is minimal issue.

Show/hide code
finaltable_exp_mstr %>% 
  filter(str_detect(community_type, "experiment")) %>% 
  filter(n_species == 4) %>% 
  group_by(sample) %>% 
  mutate(id = cur_group_id()) %>% 
  ungroup() %>% 
  mutate(facet = ntile(id, 4)) %>% 
  contaminated_barplot(threshold = 0, quartet = TRUE, y=f_raw_targetsp) +
  ggtitle("All experimental quartets") +
  facet_wrap(~facet, scales="free_x", nrow=4) +
  theme(
    strip.background = element_blank(),
    strip.text.x = element_blank(),
    legend.position = "bottom"
    )

Figure 13: Species composition of masterplate quartets.

6 Export

Show/hide code
finaltable_exp_mstr_export <- finaltable_exp_mstr %>% 
  # first remove samples with low read count (< 3000 reads)
  # filter(sample %nin% lowread_samps) %>% 
  # next remove masterplate pair samples that were highly contaminated
  # filter(sample %nin% contampairsmaster) %>% 
  # remove trio samples that were highly contaminated
  filter(sample %nin% contamtrios) %>% 
  # exclude any remaining counts from species that shouldnt be there using again
  # the fact that evo_hist should be NA for these species
  filter(!is.na(evo_hist)) %>% 
  # because we set 3% as our limit of detection we set read counts of species
  # less than 1% to 0
  mutate(count_correct_thresh = if_else(f_raw_targetsp <= 0.03, 0, count_correct)) %>% 
  # now calculate a new relative abundance based only on the species that should
  # be present and that are > 3% relative abundance
  group_by(sample) %>% 
  mutate(f_thresh = count_correct_thresh/sum(count_correct_thresh)) %>% 
  ungroup() %>%
  # for marking which sequencing batch these came from 
  mutate(batch = "run202410124") %>% 
  dplyr::select(sample, strainID, evo_hist, count_correct_thresh, f_thresh,
                target_f_masterplate = target_f, replicate, strep_conc, 
                transfers, n_species, community_type, plate_well, batch)

6.1 Pairs

Write the pair samples

Show/hide code
finaltable_exp_mstr_export %>% 
  filter(n_species == 2) %>% 
  arrange(sample) %>% 
  write_tsv(here::here(data, "pairs_counts.tsv"))

6.2 Trios

Write the trio samples

Show/hide code
finaltable_exp_mstr_export %>% 
  filter(n_species == 3) %>% 
  arrange(sample) %>% 
  write_tsv(here::here(data, "trios_counts.tsv"))

6.3 Quartets

Write the quartet samples

Show/hide code
finaltable_exp_mstr_export %>% 
  filter(n_species == 4) %>% 
  arrange(sample) %>% 
  write_tsv(here::here(data, "quartets_counts.tsv"))