RNAseq ata analysis for the project ‘bacteria-nematode interactions’

Author
Affiliation

Shane L Hogle

Published

July 20, 2025

Keywords

Microbiology, Ecology, Evolution

1 Overview

This work is done in collaboration with Dr. Liu Ting, Dr. Li Gen, and Prof. Zhong Wei at Nanjing Agriculturatl University. Here we quality control the RNAseq data from a metatranscriptomics experiment, map to the Nanjing 122 bacterial species SynCom, and count reads mapping to genes.

In subsequent steps we process the output in R so that we can provide it to our collaborators.

1.1 Format 122 species SynCom

Note this code is not meant to be executed in this project directory but is meant to provide a record of the commands used to format and process the necessary data files from the SynCom assemblies.

  1. Formatting assemblies
Show/hide code
# Get the ids of the genome assemblies
mkdir assemblies
cd assemblies_orig
printf '%s\n' *.fna | sed 's/.fna//g' > ids.txt
mv ids.txt ..
cd ..
mkdir: cannot create directory ‘assemblies’: File exists
bash: line 3: cd: assemblies_orig: No such file or directory
Show/hide code
# Here we rename all chromosomes so that they have a single name, appending an
# integer if there are multiple sequences per genome
cat ids.txt | while read ID; do seqkit replace -p ".*" -r $ID assemblies_orig/$ID.fna | seqkit rename -n | seqkit replace -p "\s.*$" -r "" | seqkit replace -p "_" -r "." > assemblies/$ID.fna; done
cat: ids.txt: No such file or directory
Show/hide code
# Now we find assemblies with multiple chromosomes for some later manual tweaks.
# This is needed to have a consistent naming scheme
grep -c ">" assemblies/* | sed 's/\:/ /g' | sort -nrk2,2
grep: assemblies/*: No such file or directory
  1. Prokka

Run prokka to call genes on the 122 assemblies and annotate them with a consistent format - see sh/prokka/prokka.sh

  1. Eggnoggmapper

Run eggnoggmapper to functionally annotate genes - see sh/eggnoggmapper/eggnoggmapper.sh

  1. Formatting for rRNA filtering (these steps performed on computer cluster)

Get rRNAs from prokka which will be used later in the quality control steps to filter out reads mapping to bacterial rRNAs

cd prokka
grep "ribosomal RNA" */*.tsv | sed 's/.tsv:/.ffn\t/g' | cut -f1,2 | while read FILE RNA; do seqkit grep -n -r -p "${RNA}" ${FILE}; done > nanjingSynCom122_rrnas.fasta
seqkit rmdup -s nanjingSynCom122_rrnas.fasta

1.2 Preparing for read mapping and feature counting

  1. Make SAF file from GFF file to be used with featureCounts
cd rnaseq/map_count/saf
cat ../../../nanjingSynCom122/ids.txt | while read ID; do sed -n '/>/q;p' ../../../nanjingSynCom122/prokka/${ID}/${ID}.gff | grep -v "^#" | sed 's/;.*$//g' | sed 's/ID=//g' | bioawk -t '{print $9, $1, $4, $5, $7}' | grep -v "^note"; done > nanjingSynCom122_combined.saf

The resulting file looks like this

head saf/nanjingSynCom122_combined.saf
GeneID  Chr Start   End Strand
1aci1_00001 1aci1   178 1365    +
1aci1_00002 1aci1   1498    2616    +
1aci1_00003 1aci1   2754    5378    +
1aci1_00004 1aci1   5603    5920    -
1aci1_00005 1aci1   6257    6868    -
1aci1_00006 1aci1   7122    7913    +
1aci1_00007 1aci1   7967    8389    +
1aci1_00008 1aci1   8463    8807    -
1aci1_00009 1aci1   8986    9246    +

Concatenate all assemblies into single file for mapping with bwa. Create the necessary indexes

cd rnaseq/map_count/refs
cat ../../../nanjingSynCom122/prokka/*/*.fna > nanjingSynCom122_combined.fna
samtools faidx nanjingSynCom122_combined.fna
bwa index nanjingSynCom122_combined.fna

1.3 Quality control RNA-seq data

Here we do standard filtering and trimming of reads.Bacterial community total cDNA read pairs are processed using BBDuk (version 39.23 (https://sourceforge.net/projects/bbmap/) to remove contaminants, trim reads that contained adapter sequence, right quality trim reads where quality drops below Q10, and exclude reads with more than 2 Ns. BBMap (version 39.23) was then used map cDNA read pairs to a database of bactieral 16S, 5S and 18S rRNAs from the 122 bacterial species SynCom based on exact matches.

The steps to reproduce this are available in sh/qualitycontrol/

1.4 Map RNA-seq data

Here we map the quality controlled read pairs to a concatenated reference of the 122 bacterial SynCom genomes using BWA. The resulting BAM files are sorted and then the program featureCounts v2.1.1 is used to count the number of reads mapping to genes in the assemblies.

The steps to reproduce this are available in sh/map_count/

1.5 Process data for differential expression

  1. Load big matrix into R and use gene identifiers to partition into n <= 122 species-specific sub-matrices

1.6 Compare Samples via Ordination

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02337-8

  1. Calculate TMM for each species sub-matrix separately.
  • https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25
  • https://www.biostars.org/p/9475236/
  • https://www.biostars.org/p/306010/
  • https://www.biostars.org/p/419018/#419177
  • https://davetang.github.io/muse/edger.html
  1. Combine TMM matrices for all species into a single matrix and use in a clustering analysis like tSNE (reproduces Fig. 3d)

For the comparison of transcript abundance across multiple samples, transcript counts need to be normalized in each organism by individual transcript lengths and total library size. Due to the differences in transcript abundances between the major and minor organisms, counts for each organism should be normalized independently using transcript per million (TPM) calculations [73]. TPM values are calculated by dividing all read counts by the length of each gene in kilobases to obtain a reads per kilobase (RPK) value for each gene [92]. The RPK value for each gene is then divided by the sum of RPK values divided by 1,000,000. While RPKM and FPKM calculations are also used for normalization, the sum of the RPKM and FPKM values differ between samples with differing numbers of reads, which can result in disproportionate comparisons [92].

1.7 Find differentially expressed genes

  1. Run a separate edgeR analysis for each species sub-matrix
  2. Concatenate results of significantly differentially expressed genes
  3. Send to collaborators

1.8 Availability

Data and code in this GitHub repository (https://github.com/GHUSERNAME/REPOSITORYNAME) are provided under GNU AGPL3. The rendered project site is available at https://GHUSERNAME.github.io/REPOSITORYNAME/, which has been produced using Quarto notebooks. The content on the rendered site is released under the CC BY 4.0. This repository hosts all code and data for this project, including the code necessary to fully recreate the rendered webpage.

An archived release of the code is available from Zenodo: https://zenodo.org/records/EVENTUAL_ZENODO_RECORD

Raw sequencing data used in the project is available from NCBI Bioproject [PRJNA00000000](https://www.ncbi.nlm.nih

1.9 Reproducibility

The project uses renv to create a reproducible environment to execute the code in this project. See here for a brief overview on collaboration and reproduction of the entire project. To get up and running from an established repository, you could do:

install.packages("renv")
renv::restore()

To initiate renv for a new project:

install.packages("renv")
# initialize
renv::init()
# install some new packages
renv::install("tidyverse")
# record those packages in the lockfile
renv::snapshot()