Deep Learning with R: Recurrent Neural Networks and TensorFlow

Author

Martin Schweinberger

Published

2026

Introduction

This tutorial introduces deep learning with R, with a focus on Recurrent Neural Networks (RNNs) and their most successful variant, Long Short-Term Memory (LSTM) networks. These architectures are specifically designed to process sequential data — making them especially well suited to tasks in language and text analysis, such as sentiment classification, authorship attribution, text generation, and sequence labelling.

Deep learning has transformed natural language processing over the past decade (Goodfellow, Bengio, and Courville 2016). Unlike classical machine learning methods that rely on hand-crafted features, deep neural networks learn hierarchical representations directly from raw data. For text, this means a model can learn from characters, words, or sentences without requiring manually designed linguistic features. The challenge with language, however, is that meaning is inherently sequential: the word not reverses the polarity of everything that follows it, and a pronoun several sentences earlier may resolve the identity of a noun that comes later. Standard feedforward networks have no mechanism for this kind of temporal reasoning. RNNs and LSTMs were designed precisely to address it.

This tutorial is aimed at intermediate users of R who are comfortable with basic programming and have some familiarity with statistical modelling. No prior knowledge of deep learning or neural networks is assumed. By the end of this tutorial you will be able to:

Explain the core concepts of neural networks, RNNs, LSTMs, and GRUs
Understand the vanishing gradient problem and why it motivates LSTM
Install and configure TensorFlow and Keras in R
Prepare text data for sequence modelling
Build, train, and evaluate an LSTM model for text classification
Apply regularisation techniques to prevent overfitting
Interpret model output and diagnose training problems

Prerequisite Knowledge

Before working through this tutorial, you should be familiar with:

Basic R syntax and data manipulation (Getting Started with R)
Loading and transforming data in R (Loading and Saving Data in R)
Basic concepts in statistical modelling (Descriptive Statistics)

It also helps to have read the Introduction to Text Analysis tutorial, as some concepts (tokenisation, corpora, frequency) are assumed knowledge here.

Citation

Martin Schweinberger. 2026. Deep Learning with R: Recurrent Neural Networks and TensorFlow. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/deeplearning_tutorial/deeplearning_tutorial.html (Version 2026.04.01), doi: 10.5281/zenodo.19362594.

Setup

Installing Packages

This tutorial requires the keras3 package (previously keras), which provides an R interface to Keras and TensorFlow. TensorFlow itself is installed as a Python library; the keras3 package handles this automatically.

Code

# Run once — comment out after installation
install.packages("keras3")
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("glue")

# Install TensorFlow backend (run once after keras3 is installed)
keras3::install_keras(backend = "tensorflow")

Python Dependency

TensorFlow requires Python 3.8–3.11 to be installed on your system. When you run install_keras(), it creates an isolated Miniconda environment named r-keras and installs TensorFlow inside it. You do not need to manage Python manually. If you encounter errors, run keras3::install_keras(envname = "r-reticulate") to use your existing reticulate environment.

Loading Packages

Code

library(keras3)
library(tidyverse)
library(ggplot2)
library(glue)
klippy::klippy()

What Is Deep Learning?

Section Overview

What you will learn: The core ideas behind neural networks; how deep learning differs from classical machine learning; and why it is useful for language data.

Deep learning refers to a family of machine learning methods based on artificial neural networks with multiple layers of processing units. The term deep refers to the depth of these networks — the number of successive transformations that input data passes through before producing an output (Goodfellow, Bengio, and Courville 2016).

The fundamental unit of a neural network is the neuron (or node): a mathematical function that takes a weighted sum of its inputs, adds a bias term, and passes the result through a non-linear activation function. Formally, for a neuron \(j\) with inputs \(x_1, \ldots, x_n\):

\[a_j = f\!\left(\sum_{i=1}^{n} w_{ij} x_i + b_j\right)\]

where \(w_{ij}\) are the learned weights, \(b_j\) is the bias, and \(f\) is an activation function such as the rectified linear unit (ReLU): \(f(x) = \max(0, x)\).

Neurons are arranged in layers. A standard feedforward network has:

An input layer that receives the raw data
One or more hidden layers that transform the representation
An output layer that produces the final prediction

The network learns by adjusting its weights to minimise a loss function (a measure of prediction error) using an algorithm called backpropagation, which propagates error gradients from the output back through the network (Rumelhart, Hinton, and Williams 1986). The weights are updated iteratively using gradient descent, moving slightly in the direction that reduces the loss at each step.

Why Deep Learning for Text?

Classical text analysis methods — bag-of-words, TF-IDF, topic models — treat documents as unordered collections of tokens. Deep learning methods, by contrast, can capture:

Word order: the sequence dog bites man differs from man bites dog
Long-range dependencies: a negation early in a sentence may affect a sentiment word several tokens later
Compositional meaning: the meaning of not bad is not the sum of not and bad
Sub-word patterns: morphological regularities can be captured at the character level

These properties make deep learning particularly powerful for sequence tasks in NLP.

Recurrent Neural Networks

Section Overview

What you will learn: How RNNs process sequences; what a hidden state is; and why simple RNNs struggle with long-range dependencies.

The Core Idea

A Recurrent Neural Network (RNN) is a neural network designed to process sequential data by maintaining a hidden state — a vector that summarises everything the network has seen up to the current time step. At each step \(t\), the RNN receives an input \(x_t\) and updates its hidden state \(h_t\) based on both the current input and the previous hidden state:

\[h_t = f\!\left(W_h h_{t-1} + W_x x_t + b\right)\]

where \(W_h\) and \(W_x\) are weight matrices and \(f\) is typically the hyperbolic tangent (\(\tanh\)). The hidden state acts as a form of memory: it encodes information about previous inputs that is relevant to the current prediction.

This design was formalised in the Simple Recurrent Network (SRN) of Elman (1990), one of the earliest systematic studies of what information RNNs actually retain in their hidden states across time. Elman’s experiments with grammatical structure showed that RNNs could, in principle, learn implicit syntactic representations — a finding that foreshadowed modern neural language modelling.

Unrolling Through Time

To understand how RNNs are trained, it helps to visualise the network unrolled through time. An RNN processing a sequence of length \(T\) is equivalent to a deep feedforward network with \(T\) layers, where each layer shares the same weights. Training proceeds via Backpropagation Through Time (BPTT): error gradients are propagated backwards through the unrolled network, updating the shared weights at each time step.

The Vanishing Gradient Problem

BPTT requires multiplying gradients across every time step. Because the same weight matrix \(W_h\) is applied repeatedly, gradients tend to either vanish (shrink exponentially, making early time steps unreachable) or explode (grow exponentially, causing instability).

Bengio, Simard, and Frasconi (1994) provided a rigorous mathematical analysis of this problem, showing that it is a fundamental limitation of gradient descent for sequences with long-range dependencies. For language tasks — where dependencies can span dozens of tokens — this is a severe constraint on what simple RNNs can learn.

Gradient clipping (capping gradient norms during training) partially addresses explosion, but not vanishing. The solution to vanishing gradients required a new architecture: the Long Short-Term Memory.

Long Short-Term Memory Networks

Section Overview

What you will learn: How LSTMs solve the vanishing gradient problem using gating mechanisms; what each gate does; and how to interpret the LSTM cell state as explicit memory.

The LSTM Architecture

The Long Short-Term Memory (LSTM) network was introduced by Hochreiter and Schmidhuber (1997) to address the vanishing gradient problem directly. The key innovation is a cell state \(C_t\) — a separate memory pathway that runs through the entire sequence with minimal transformations, allowing gradients to flow back without vanishing. Access to the cell state is regulated by three gates: learned sigmoid functions that produce values between 0 and 1, controlling how much information is added, retained, or discarded.

The full LSTM update equations at time step \(t\) are:

\[f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)}\] \[i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \quad \text{(input gate)}\] \[\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C) \quad \text{(candidate cell state)}\] \[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \quad \text{(cell state update)}\] \[o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \quad \text{(output gate)}\] \[h_t = o_t \odot \tanh(C_t) \quad \text{(hidden state)}\]

where \(\sigma\) is the sigmoid function, \(\odot\) is element-wise multiplication, and \([h_{t-1}, x_t]\) denotes concatenation of the previous hidden state and current input.

What Each Gate Does

Gate	Function	Analogy
Forget gate \(f_t\)	Decides what to erase from the cell state	Pressing the delete key
Input gate \(i_t\)	Decides what new information to write to the cell state	Taking notes
Output gate \(o_t\)	Decides what part of the cell state to expose as the hidden state	Reading from notes

The cell state \(C_t\) itself acts as a conveyor belt that carries information across long distances. Because it is updated additively (not multiplicatively across many layers), gradients flow back through it without exponential decay.

A comprehensive empirical evaluation of LSTM variants by Greff et al. (2017) confirmed that all three gates are important contributions: removing any one of them reliably degrades performance across a range of sequence tasks.

Gated Recurrent Units

The Gated Recurrent Unit (GRU), introduced by Cho et al. (2014), is a streamlined variant of LSTM that merges the forget and input gates into a single update gate and eliminates the separate cell state. GRUs have fewer parameters than LSTMs and often achieve comparable performance, particularly on smaller datasets.

\[z_t = \sigma(W_z [h_{t-1}, x_t] + b_z) \quad \text{(update gate)}\] \[r_t = \sigma(W_r [h_{t-1}, x_t] + b_r) \quad \text{(reset gate)}\] \[\tilde{h}_t = \tanh(W [r_t \odot h_{t-1}, x_t] + b) \quad \text{(candidate state)}\] \[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \quad \text{(hidden state)}\]

In practice, LSTMs and GRUs are largely interchangeable. A useful default is to start with an LSTM, then try a GRU if training speed is a concern or if the dataset is small.

Bidirectional RNNs

A standard RNN reads sequences left to right. A Bidirectional RNN runs two RNNs in parallel — one processing the sequence forward, the other backward — and concatenates their hidden states at each time step. This gives the model access to both past and future context at every position, which is valuable for tasks like part-of-speech tagging or named entity recognition where the label at position \(t\) depends on words that follow as well as words that precede.

Building an LSTM in R

Section Overview

What you will learn: How to prepare text data for an LSTM; how to define, compile, train, and evaluate a Keras LSTM model; and how to interpret training curves.

The Task: Sentiment Classification

We will train an LSTM to classify movie reviews as positive or negative using the IMDB dataset — 25,000 training reviews and 25,000 test reviews, each labelled with a binary sentiment. This is a standard benchmark for text classification and is bundled directly with Keras, making it convenient for learning purposes.

Preparing the Data

Code

# Maximum vocabulary size (most frequent tokens only)
max_features <- 10000
# Maximum sequence length (longer reviews are truncated)
maxlen <- 200

# Load IMDB data — labels are 0 (negative) or 1 (positive)
imdb      <- dataset_imdb(num_words = max_features)
x_train   <- imdb$train$x
y_train   <- imdb$train$y
x_test    <- imdb$test$x
y_test    <- imdb$test$y

cat(glue("Training samples: {length(x_train)}\n"))

Training samples: 25000

Code

cat(glue("Test samples:     {length(x_test)}\n"))

Test samples:     25000

Each review is stored as an integer sequence — each integer is a word index. We pad all sequences to the same length using pad_sequences(), which appends zeros at the beginning of shorter sequences (pre-padding):

Code

x_train <- pad_sequences(x_train, maxlen = maxlen)
x_test  <- pad_sequences(x_test,  maxlen = maxlen)

dim(x_train)  # Should be (25000, 200)

[1] 25000   200

What Is an Embedding?

Before the LSTM layer, we use an Embedding layer. This maps each integer (word index) to a dense, real-valued vector of fixed dimension (here, 128). These vectors are learned during training alongside the other weights: words that appear in similar contexts end up with similar vectors. Embeddings are a compact, continuous representation of vocabulary that carries far more information than one-hot encoding.

Defining the Model

Code

model <- keras_model_sequential(name = "imdb_lstm") |>

layer_embedding(
  input_dim    = max_features,
  output_dim   = 128,
  input_length = maxlen,      # add this line
  name         = "embedding"
) |>

  # LSTM layer: processes the sequence
  layer_lstm(
    units          = 64,
    dropout        = 0.2,   # dropout on inputs
    recurrent_dropout = 0.2, # dropout on recurrent connections
    name           = "lstm_1"
  ) |>

  # Dense output layer: binary classification
  layer_dense(units = 1, activation = "sigmoid", name = "output")

summary(model)

Model: "imdb_lstm"
┌───────────────────────────────────┬──────────────────────────┬───────────────
│ Layer (type)                      │ Output Shape             │       Param # 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ embedding (Embedding)             │ ?                        │   0 (unbuilt) 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ lstm_1 (LSTM)                     │ ?                        │   0 (unbuilt) 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ output (Dense)                    │ ?                        │   0 (unbuilt) 
└───────────────────────────────────┴──────────────────────────┴───────────────
 Total params: 0 (0.00 B)
 Trainable params: 0 (0.00 B)
 Non-trainable params: 0 (0.00 B)

Dropout in LSTMs

Dropout (Srivastava et al. 2014) randomly sets a proportion of activations to zero during training, preventing the network from co-adapting to specific patterns and reducing overfitting. In LSTMs, dropout applies to the input connections and recurrent_dropout applies to the recurrent connections (from \(h_{t-1}\) to \(h_t\)). Using a small recurrent dropout (0.1–0.3) is often more beneficial than a large input dropout alone. As recommended by Chollet and Allaire (2018), the same dropout mask is applied at every time step — this is essential for correct regularisation of RNNs.

Compiling and Training

Code

model |> compile(
  optimizer = optimizer_adam(learning_rate = 1e-3),
  loss      = "binary_crossentropy",
  metrics   = c("accuracy")
)

Code

history <- model |> fit(
  x_train, y_train,
  epochs          = 10,
  batch_size      = 128,
  validation_split = 0.2,   # hold out 20% of training data for validation
  callbacks = list(
    callback_early_stopping(
      monitor              = "val_loss",
      patience             = 3,
      restore_best_weights = TRUE
    )
  )
)

Epoch 1/10
157/157 - 22s - 141ms/step - accuracy: 0.7505 - loss: 0.5014 - val_accuracy: 0.8354 - val_loss: 0.3784
Epoch 2/10
157/157 - 15s - 99ms/step - accuracy: 0.8673 - loss: 0.3327 - val_accuracy: 0.8522 - val_loss: 0.3487
Epoch 3/10
157/157 - 16s - 99ms/step - accuracy: 0.8842 - loss: 0.2863 - val_accuracy: 0.8066 - val_loss: 0.4330
Epoch 4/10
157/157 - 15s - 95ms/step - accuracy: 0.8863 - loss: 0.2824 - val_accuracy: 0.8548 - val_loss: 0.3570
Epoch 5/10
157/157 - 17s - 105ms/step - accuracy: 0.9172 - loss: 0.2162 - val_accuracy: 0.8490 - val_loss: 0.3717

Early Stopping

callback_early_stopping() monitors the validation loss after each epoch and stops training when it has not improved for patience consecutive epochs. Setting restore_best_weights = TRUE ensures that the final model uses the weights from the epoch with the lowest validation loss, not the last epoch (which may be worse due to overfitting). This is an important safeguard against over-training.

Evaluating the Model

Code

results   <- model |> evaluate(x_test, y_test, verbose = 0)
test_loss <- as.numeric(results[["loss"]])
test_acc  <- as.numeric(results[["accuracy"]])
cat(glue("Test loss:     {round(test_loss, 4)}\n"))

Test loss:     0.3497

Code

cat(glue("Test accuracy: {round(test_acc * 100, 2)}%\n"))

Test accuracy: 85.52%

Code

# Plot training and validation accuracy / loss over epochs
plot(history) +
  theme_minimal(base_size = 13) +
  labs(title = "LSTM Training History — IMDB Sentiment",
       x = "Epoch")

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: span too small.  fewer data values than degrees of freedom.

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.98

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 2.02

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 0

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 4.0804

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: span too small.  fewer data values than degrees of freedom.

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.98

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 2.02

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 0

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 4.0804

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: span too small.  fewer data values than degrees of freedom.

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.98

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 2.02

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 0

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 4.0804

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: span too small.  fewer data values than degrees of freedom.

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.98

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 2.02

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 0

Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 4.0804

A well-trained model on this dataset typically achieves around 85–88% test accuracy. If validation loss starts rising while training loss continues to fall, that is a clear sign of overfitting.

A Stacked LSTM: Going Deeper

Section Overview

What you will learn: How to stack multiple LSTM layers; when stacking helps; and how to use return_sequences correctly.

A single LSTM layer learns a single level of sequential abstraction. Stacked LSTMs (multiple LSTM layers on top of each other) can learn hierarchical representations — lower layers may capture local patterns (words and phrases), while higher layers capture more global structure (clauses, discourse). For most NLP tasks, 2–3 LSTM layers offer a good balance of capacity and training cost.

Code

model_stacked <- keras_model_sequential(name = "stacked_lstm") |>

  layer_embedding(
    input_dim    = max_features,
    output_dim   = 128,
    input_length = maxlen,      # add this line
    name         = "embedding"
  ) |>

  # First LSTM — must return full sequence for the next LSTM to consume
  layer_lstm(
    units             = 64,
    dropout           = 0.2,
    recurrent_dropout = 0.2,
    return_sequences  = TRUE,   # <-- essential for stacking
    name              = "lstm_1"
  ) |>

  # Second LSTM — only needs the final hidden state
  layer_lstm(
    units             = 32,
    dropout           = 0.2,
    recurrent_dropout = 0.2,
    return_sequences  = FALSE,
    name              = "lstm_2"
  ) |>

  layer_dense(units = 1, activation = "sigmoid", name = "output")

model_stacked |> compile(
  optimizer = optimizer_adam(learning_rate = 1e-3),
  loss      = "binary_crossentropy",
  metrics   = c("accuracy")
)

summary(model_stacked)

Model: "stacked_lstm"
┌───────────────────────────────────┬──────────────────────────┬───────────────
│ Layer (type)                      │ Output Shape             │       Param # 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ embedding (Embedding)             │ ?                        │   0 (unbuilt) 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ lstm_1 (LSTM)                     │ ?                        │   0 (unbuilt) 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ lstm_2 (LSTM)                     │ ?                        │   0 (unbuilt) 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ output (Dense)                    │ ?                        │   0 (unbuilt) 
└───────────────────────────────────┴──────────────────────────┴───────────────
 Total params: 0 (0.00 B)
 Trainable params: 0 (0.00 B)
 Non-trainable params: 0 (0.00 B)

return_sequences = TRUE

By default, an LSTM layer returns only the final hidden state — a single vector summarising the entire sequence. This is appropriate when the next layer is a Dense layer. When stacking LSTMs, the lower layer must return the full sequence of hidden states (one per time step) so that the upper LSTM layer has something to process. Always set return_sequences = TRUE on all LSTM layers except the last one in a stack.

Building the Model Eagerly in keras3

In keras3, sequential models are lazily built by default — shapes and parameter counts are not resolved until the model sees its first batch of data. To force the model to build immediately (so that summary() shows real parameter counts), pass input_length = maxlen to layer_embedding(). This tells Keras the exact sequence length at definition time and triggers eager construction of all downstream layers.

Using GRU Instead of LSTM

The GRU is a drop-in replacement for the LSTM layer. Because it has fewer parameters, it trains faster and can be preferable when data is limited:

Code

model_gru <- keras_model_sequential(name = "gru_sentiment") |>
  layer_embedding(input_dim = max_features, output_dim = 128,
                input_length = maxlen) |>   # add input_length
  layer_gru(
    units             = 64,
    dropout           = 0.2,
    recurrent_dropout = 0.2
  ) |>
  layer_dense(units = 1, activation = "sigmoid")

model_gru |> compile(
  optimizer = optimizer_adam(),
  loss      = "binary_crossentropy",
  metrics   = c("accuracy")
)

Tips for Training RNNs

Section Overview

What you will learn: Practical guidance for getting RNNs to train reliably, including choices of optimiser, sequence length, embedding dimension, and regularisation.

Training recurrent networks reliably requires attention to several practical considerations.

Sequence Length

Very long sequences are expensive and often unnecessary. Most sentiment in a movie review is established within the first 200–300 tokens; padding to the full review length wastes computation. A good default is to use the 90th percentile of token counts in your dataset as maxlen.

Embedding Dimension

The embedding dimension controls the size of the dense word vector. Common choices are 64, 128, or 256. Larger embeddings capture more nuance but add parameters. For small datasets, pre-trained embeddings (GloVe, fastText) can be frozen or fine-tuned in the layer_embedding() call.

Optimiser and Learning Rate

Adam with a learning rate of 1e-3 is a reliable default. If the model diverges (loss increases sharply), lower it to 1e-4. Using a learning rate schedule (e.g., reduce on plateau) helps squeeze out the last few percentage points of accuracy.

Regularisation Checklist

Technique	How to apply in Keras
Dropout on inputs	`dropout = 0.2` in `layer_lstm()`
Recurrent dropout	`recurrent_dropout = 0.2` in `layer_lstm()`
L2 weight regularisation	`kernel_regularizer = regularizer_l2(0.001)` in any layer
Early stopping	`callback_early_stopping(patience = 3)` in `fit()`
Reduce learning rate	`callback_reduce_lr_on_plateau()` in `fit()`

Overfitting Is Common

RNNs — especially stacked LSTMs — have many parameters and overfit easily on small text datasets (fewer than ~5,000 labelled examples). Signs of overfitting are: training accuracy climbing while validation accuracy plateaus or falls, and training loss decreasing while validation loss increases. Remedies, in order of priority: more dropout, early stopping, reducing model size (fewer units), and adding L2 regularisation.

A Character-Level Language Model

Section Overview

What you will learn: How to apply an LSTM to character-level sequence generation — a classic demonstration of what RNNs can learn about language structure.

Sentiment classification is a many-to-one task (sequence → single label). RNNs also excel at many-to-many tasks, where the model produces one output for every input time step. A classic example is character-level language modelling: given a sequence of characters, predict the next character. This forces the model to learn spelling, punctuation, word boundaries, and some grammar implicitly, with no explicit supervision.

Preparing Character Data

Code

# Example: use a small text (substitute your own corpus)
text_url <- "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text_raw <- readLines(text_url) |> paste(collapse = "\n")

# Build character vocabulary
chars     <- text_raw |> strsplit("") |> unlist() |> unique() |> sort()
char2idx  <- setNames(seq_along(chars) - 1L, chars)   # 0-based for Keras
idx2char  <- setNames(chars, seq_along(chars) - 1L)
vocab_size <- length(chars)

cat(glue("Corpus length: {nchar(text_raw)} characters\n"))

Corpus length: 1115393 characters

Code

cat(glue("Vocabulary size: {vocab_size} unique characters\n"))

Vocabulary size: 65 unique characters

Code

seq_length <- 40   # Input window: 40 characters
step       <- 3    # Sampling stride

# Build (input, target) pairs
starts   <- seq(1, nchar(text_raw) - seq_length, by = step)
x_chars  <- lapply(starts, function(i) substr(text_raw, i, i + seq_length - 1))
y_chars  <- lapply(starts, function(i) substr(text_raw, i + seq_length, i + seq_length))

cat(glue("Number of sequences: {length(x_chars)}\n"))

Number of sequences: 371785

Code

# ── Encode sequences as integers ────────────────────────────────────────────
n_seq <- length(x_chars)

x_int <- matrix(0L, nrow = n_seq, ncol = seq_length)
y_int <- integer(n_seq)

for (i in seq_len(n_seq)) {
  seq_chars   <- strsplit(x_chars[[i]], "")[[1]]
  x_int[i, ] <- char2idx[seq_chars] + 1L
  y_int[[i]] <- char2idx[[y_chars[[i]]]] + 1L
}

y_array <- to_categorical(y_int - 1L, num_classes = vocab_size)

# ── Define model ─────────────────────────────────────────────────────────────
model_char <- keras_model_sequential(name = "char_lstm") |>
  layer_embedding(
    input_dim    = vocab_size + 1L,
    output_dim   = 64,
    input_length = seq_length,
    name         = "embedding"
  ) |>
  layer_lstm(units = 128, return_sequences = FALSE, name = "lstm") |>
  layer_dense(units = vocab_size, activation = "softmax", name = "output")

model_char |> compile(
  optimizer = optimizer_rmsprop(learning_rate = 0.01),
  loss      = "categorical_crossentropy"
)

summary(model_char)

Model: "char_lstm"
┌───────────────────────────────────┬──────────────────────────┬───────────────
│ Layer (type)                      │ Output Shape             │       Param # 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ embedding (Embedding)             │ ?                        │   0 (unbuilt) 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ lstm (LSTM)                       │ ?                        │   0 (unbuilt) 
├───────────────────────────────────┼──────────────────────────┼───────────────
│ output (Dense)                    │ ?                        │   0 (unbuilt) 
└───────────────────────────────────┴──────────────────────────┴───────────────
 Total params: 0 (0.00 B)
 Trainable params: 0 (0.00 B)
 Non-trainable params: 0 (0.00 B)

Code

# ── Train ─────────────────────────────────────────────────────────────────────
history_char <- model_char |> fit(
  x_int, y_array,
  batch_size       = 128,
  epochs           = 10,
  validation_split = 0.1
)

Epoch 1/10
2615/2615 - 117s - 45ms/step - loss: 1.8967 - val_loss: 1.8828
Epoch 2/10
2615/2615 - 117s - 45ms/step - loss: 1.6564 - val_loss: 1.8488
Epoch 3/10
2615/2615 - 119s - 46ms/step - loss: 1.6150 - val_loss: 1.8286
Epoch 4/10
2615/2615 - 112s - 43ms/step - loss: 1.6002 - val_loss: 1.8412
Epoch 5/10
2615/2615 - 109s - 42ms/step - loss: 1.5941 - val_loss: 1.8574
Epoch 6/10
2615/2615 - 111s - 42ms/step - loss: 1.5952 - val_loss: 1.8794
Epoch 7/10
2615/2615 - 117s - 45ms/step - loss: 1.5941 - val_loss: 1.8966
Epoch 8/10
2615/2615 - 111s - 42ms/step - loss: 1.5953 - val_loss: 1.8806
Epoch 9/10
2615/2615 - 109s - 42ms/step - loss: 1.5950 - val_loss: 1.8687
Epoch 10/10
2615/2615 - 109s - 42ms/step - loss: 1.5980 - val_loss: 1.8723

Code

# ── Generate text ─────────────────────────────────────────────────────────────
sample_next_char <- function(preds, temperature = 1.0) {
  preds <- log(preds + 1e-8) / temperature
  preds <- exp(preds) / sum(exp(preds))
  sample(seq_along(preds), size = 1, prob = preds) - 1L
}

generate_text <- function(model, seed_text, n_chars = 200, temperature = 0.5) {
  generated <- seed_text
  current   <- seed_text
  for (i in seq_len(n_chars)) {
    chars_in <- strsplit(current, "")[[1]]
    x        <- matrix(char2idx[chars_in] + 1L, nrow = 1L)
    preds    <- predict(model, x, verbose = 0)[1, ]
    next_idx <- sample_next_char(preds, temperature)
    next_ch  <- idx2char[[as.character(next_idx)]]
    generated <- paste0(generated, next_ch)
    current   <- paste0(substr(current, 2, nchar(current)), next_ch)
  }
  generated
}

seed <- substr(text_raw, 1, seq_length)
cat(generate_text(model_char, seed, n_chars = 300, temperature = 0.6))

First Citizen:
Before we proceed any furn we of duke myself.

A Yorkbour it; be a hand to see he canst the offero,
Lay I talk the death the heaven life my father,
As this stance! thou this will words of enter,
That thou from then for a pricess to this?

CLARENCE:
But back of the hear the son thy discouse myself.

First:
As they living wit

Temperature Sampling

The temperature parameter controls how deterministic the model’s choices are. A temperature of 1.0 uses the model’s raw probability distribution. Values below 1.0 sharpen the distribution (more predictable output), while values above 1.0 flatten it (more varied and risky output). A value around 0.5–0.7 typically produces the most readable generated text.

Citation & Session Info

Citation

@manual{martinschweinberger2026deep,
  author       = {Martin Schweinberger},
  title        = {Deep Learning with R: Recurrent Neural Networks and TensorFlow},
  year         = {2026},
  note         = {https://ladal.edu.au/tutorials/deeplearning_tutorial/deeplearning_tutorial.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {2026.04.01},
  doi          = {10.5281/zenodo.19362594}
}

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] glue_1.8.0      lubridate_1.9.4 forcats_1.0.0   stringr_1.6.0  
 [5] dplyr_1.2.0     purrr_1.2.1     readr_2.1.5     tidyr_1.3.2    
 [9] tibble_3.3.1    ggplot2_4.0.2   tidyverse_2.0.0 keras3_1.5.1   

loaded via a namespace (and not attached):
 [1] generics_0.1.4      renv_1.1.7          stringi_1.8.7      
 [4] lattice_0.22-6      hms_1.1.4           digest_0.6.39      
 [7] magrittr_2.0.4      timechange_0.3.0    evaluate_1.0.5     
[10] grid_4.4.2          RColorBrewer_1.1-3  fastmap_1.2.0      
[13] jsonlite_2.0.0      Matrix_1.7-2        whisker_0.4.1      
[16] tfruns_1.5.4        BiocManager_1.30.27 tensorflow_2.20.0  
[19] scales_1.4.0        codetools_0.2-20    cli_3.6.5          
[22] rlang_1.1.7         base64enc_0.1-6     withr_3.0.2        
[25] yaml_2.3.10         tools_4.4.2         tzdb_0.5.0         
[28] zeallot_0.1.0       reticulate_1.45.0   vctrs_0.7.2        
[31] R6_2.6.1            png_0.1-8           lifecycle_1.0.5    
[34] htmlwidgets_1.6.4   pkgconfig_2.0.3     pillar_1.11.1      
[37] gtable_0.3.6        Rcpp_1.1.1          xfun_0.56          
[40] tidyselect_1.2.1    rstudioapi_0.17.1   knitr_1.51         
[43] farver_2.1.2        htmltools_0.5.9     rmarkdown_2.30     
[46] dotty_0.1.0         compiler_4.4.2      S7_0.2.1

AI Transparency Statement

This tutorial was re-developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the checkdown quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.

References

Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. 1994. “Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks 5 (2): 157–66. https://doi.org/10.1109/72.279181.

Cho, Kyunghyun, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. “Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–34. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1179.

Chollet, François, and J. J. Allaire. 2018. Deep Learning with R. Shelter Island, NY: Manning Publications.

Elman, Jeffrey L. 1990. “Finding Structure in Time.” Cognitive Science 14 (2): 179–211. https://doi.org/10.1207/s15516709cog1402_1.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.

Greff, Klaus, Rupesh Kumar Srivastava, Jan Koutník, Bas Robert Steunebrink, and Jürgen Schmidhuber. 2017. “LSTM: A Search Space Odyssey.” IEEE Transactions on Neural Networks and Learning Systems 28 (10): 2222–32. https://doi.org/10.1109/TNNLS.2016.2582924.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323: 533–36. https://doi.org/10.1038/323533a0.

Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (1): 1929–58. https://jmlr.org/papers/v15/srivastava14a.html.