This tutorial introduces the concept of Reinforcement Learning (RL) (see Sutton and Barto 2018; Wu et al. 2018; Paulus, Xiong, and Socher 2017), and how it can be applied in the domain of Natural Language Processing (NLP) and linguistics.

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how RI works. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with RI.

The entire R Notebook for the tutorial can be downloaded here. If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the bibliography file and store it in the same folder where you store the Rmd file.

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts in this tutorial are executed without errors. Before continuing, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

For this tutorial we will be primarily requiring four packages: tidytext for text manipulations, tidyverse for general tasks, textrank for the implementation of the TextRank algorithm and rvest to scrape through an article to use as an example. For this analysis an article for Time has been selected.

# set options
options(stringsAsFactors = F)
# install libraries
# install klippy for copy-to-clipboard button in code chunks

Now that we have installed the packages, we activate them as shown below.

# set options
options(stringsAsFactors = F)          # no automatic data transformation
options("scipen" = 100, "digits" = 12) # suppress math annotation
# activate packages
# activate klippy for copy-to-clipboard button

Once you have installed R and RStudio and also initiated the session by executing the code shown above, you are good to go.

Reinforment Learning

Reinforcement Learning enables a machines and software agents to independently determine the optimal behavior depending on a specific concept to enhance the overall performance. The system requires a reward feedback to learn its behavior which is known as reinforcement signal. The schematic diagram of Reinforcement Learning is provided below: -

Any RL framework comprises of 3 major components:

  • Action determines all possible moves that the agent can make which is normally expressed as a mathematical function.
  • State it is an explicit and quick circumstance that the agent can find itself in posed by the environment or any future circumstance
  • Reward it is the feedback input from the environment which measure the achievement or failure of the agent’s activities.

The are three broad categories of RL:

  • Value Based which determines the optimal value function and it is the maximum value achievable under any policy.
  • Policy Based which identifies the optimal policy achieving maximum future reward
  • Model Based involves a model which predicts attributes or provides representation of the environment

Without going into the mathematical intricacies of RL we will focus on possible applications of deep RL to linguistic data this tutorial. In its current form, RL plays a pivotal role in various Natural Language Processing (NLP) applications some of which are:

  • Article Summarisation
  • Question Answering (QA)
  • Dialogue Generation
  • Dialogue System
  • Knowledge-based QA
  • Machine Translation
  • Text Generation

In the following sections we will explore some use cases of RL and interpret how deep RL can implement them.

Text Summarisation

A deep reinforced model for text summarisation involves sequence of input tokens x={x1,x2,…,xn} and produces a sequence of output (summary) tokens. A schematic presentation of the process is shown below:

For the article summarisation objective the deep RL has the following components:

  • Action which involves a function ut which copies and generates summary output yt
  • State it encapsulates the hidden states of encoder and previous outputs
  • Reward which generates a rough score determining the performance of the summarisation

Text summarisation (see Mihalcea and Tarau 2004) is highly critical in extracting important information from large texts.

In case of text summarisation there are broadly two categories:

  • Extractive Summarisation
  • Abstractive Summarisation

In case of Extractive Summarisation words and sentences are scored according to a specific metric and then utilizing that information for summarizing based copying or pasting the most informative parts of the text.

On the other hand Abstractive Summarisation involves building a semantic representation of the text and then incorporating natural language generation techniques to generate text highlighting the informative parts of the parent text document.

Here, we will be focusing on an extractive summarisation method called TextRank which is hinged upon the PageRank algorithm which was developed by Google to rank websites based on their importance.

The TextRank Algorithm

TextRank is a graph-based ranking algorithm for NLP. Graph-based ranking algorithms evaluate the importance of a vertex within a graph, based on global information extracted recursively from the entire graph. When one vertex is associated with another it is actually casting a vote for that vertex. The higher the number of votes cast for a vertex, the higher importance of that vertex.

In the NLP case it is necessary to define vertices and edges. In this tutorial we will be using sentences as vertices and words as edges. Thus sentences with words present in many other sentences will have higher priority

# define url
url <- ""
# read in data
article <- read_html(url) %>%
  html_nodes('div[class="padded"]') %>%

Next the article is loaded into a tibble. Then tokenisation is implemented according to sentences. Although this tokenisation is fully perfect it has a lower number of dependencies and is suitable for this case. Finally we add column for sentence number and switch the order of the columns.

article_sentences <- tibble(text = article) %>%
  tidytext::unnest_tokens(sentence, text, token = "sentences") %>%
  dplyr::mutate(sentence_id = row_number()) %>%
  dplyr::select(sentence_id, sentence)
## # A tibble: 19 × 2
##    sentence_id sentence                                                         
##          <int> <chr>                                                            
##  1           1 "fitbit is launching a new fitness tracker designed for children…
##  2           2 "the [tempo-ecommerce src=”…
##  3           3 "the most important of which is fitbit’s new family account opti…
##  4           4 "parents must approve who their child can connect with via the f…
##  5           5 "but while fitbit’s default move goal is 30 minutes for adult us…
##  6           6 "fitbit says the tracker is designed for children eight years ol…
##  7           7 "fitbit will also be introducing a family faceoff feature that l…
##  8           8 "the app also will reward children with in-app badges for achiev…
##  9           9 "fitbit’s new child-friendly fitness band will be available in b…
## 10          10 "the ace launch is part of fitbit’s broader goal of branching ou…
## 11          11 "the company also announced a new smartwatch on tuesday called t…
## 12          12 "above all else, the ace is an effort to get children up and mov…
## 13          13 "the centers for disease control and prevention report that the …
## 14          14 "but parents who want to encourage their children to move alread…
## 15          15 "garmin’s $79.99 vivofit jr. 2, for example, comes in themed ski…
## 16          16 "the $39.99 nabi compete, meanwhile, is sold in pairs so that fa…
## 17          17 "more must-read stories from time                               …
## 18          18 "here's what to expect    the u.s."                              
## 19          19 "might be the surprising determining factor in the future of arm…

Next we will tokenize based on words.

article_words <- article_sentences %>%
  tidytext::unnest_tokens(word, sentence)
## # A tibble: 603 × 2
##    sentence_id word     
##          <int> <chr>    
##  1           1 fitbit   
##  2           1 is       
##  3           1 launching
##  4           1 a        
##  5           1 new      
##  6           1 fitness  
##  7           1 tracker  
##  8           1 designed 
##  9           1 for      
## 10           1 children 
## # … with 593 more rows

We have one last step left is to remove the stop words in article_words as they are prone to result in redundancy.

article_words <- article_words %>%
  dplyr::anti_join(stop_words, by = "word")
## # A tibble: 299 × 2
##    sentence_id word     
##          <int> <chr>    
##  1           1 fitbit   
##  2           1 launching
##  3           1 fitness  
##  4           1 tracker  
##  5           1 designed 
##  6           1 children 
##  7           1 called   
##  8           1 fitbit   
##  9           1 ace      
## 10           1 sale     
## # … with 289 more rows

Using the textrank package it is really easy to implement the TextRank algorithm. The textrank_sentences function requires only 2 inputs:

  • A data frame with sentences
  • A data frame with tokens which are part of each sentence
article_summary <- textrank_sentences(data = article_sentences, 
                                      terminology = article_words)
# inspect the summary
## Textrank on sentences, showing top 5 most important sentences found:
##   1. fitbit is launching a new fitness tracker designed for children called the fitbit ace, which will go on sale for $99.95 in the second quarter of this year.
##   2. fitbit says the tracker is designed for children eight years old and up.
##   3. above all else, the ace is an effort to get children up and moving.
##   4. but parents who want to encourage their children to move already have several less expensive options to choose from.
##   5. the most important of which is fitbit’s new family account option, which gives parents control over how their child uses their tracker and is compliant with the children’s online privacy protection act, or coppa.

Lets have a look where these important sentences appear in the article:

article_summary[["sentences"]] %>%
  ggplot(aes(textrank_id, textrank, fill = textrank_id)) +
  geom_col() +
  theme_minimal() +
  scale_fill_viridis_c() +
  guides(fill = "none") +
  labs(x = "Sentence",
       y = "TextRank score",
       title = "Most informative sentences appear within first half of sentences",
       subtitle = 'In article "Fitbits Newest Fitness Tracker Is Just for Kids"',
       caption = "Source:")
Position of Important Sentences in the Article

Position of Important Sentences in the Article

Other Applications of RL

Dialogue Generation

In today’s digital world dialogue generation is a widely used application especially in chatbots. One widely used model in this regard is the Long Short Term Memory (LSTM) sequence-to-sequence (SEQ2SEQ) model. It is a neural generative model that maximizes the probability of generating a response given the previous dialogue. However SEQ2SEQ model has some constraints:

  • They tend to generate highly generic responses
  • Often they are stuck in an infinite loop of repetitive responses

This is where deep RL is much more efficient as it can integrate developer-defined rewards which efficiently mimics the true goal of chatbot development. In case of dialogue generation the component:

  • Action which involves a function that generates sequences of arbitrary lengths
  • State it comprises of previous 2 dialogue turns [pi,qi]
  • Reward which determines the ease of answering, information flow and semantic coherence

The schematic diagram highlighting the dialogue simulation between 2 agents using deep RL is shown below:

Neural Machine Translation

Most of Neural Machine Translation (NMT) models are based encoder-decoder framework with attention mechanism. The encoder initially maps a source sentence x={x1,x2,…,xn} to a set of continuous representations z={z1,z2,…,zn} . Given z the decoder then generates a target sentence y={y1,y2,…,ym} of word tokens one by one. RL is used to bridge the gap between training and inference of of NMT by directly optimizing the loss function at training time. In this scenario the NMT model acts as the agent which interacts with the environment which in this case are the previous words and the context vector z available at each step t. This is a a policy based RL and in place of a state a policy will be assigned in every iteration. The critical components of the RL for NMT are discussed below:

  • Policy which is a conditional probability defined by the parameters of the agent
  • Action is decided by the agent based on the policy and it will pick up a candidate word from the vocabulary
  • Reward is evaluated once the agent generates a complete sequence which in case of machine translation is Bilingual Evaluation Understudy (BLEU).BLEU is defined by comparing the generated sequence with the ground truth sequence.

The schematic of the overall process is depicted below:

Citation & Session Info

Majumdar, Dattatreya. 2022. Reinforcement Learning in NLP. Brisbane: The University of Queensland. url: (Version 2022.10.10).

  author = {Majumdar, Dattatreya},
  title = {Reinforcement Learning in NLP},
  note = {},
  year = {2022},
  organization = "The University of Queensland, Australia. School of Languages and Cultures},
  address = {Brisbane},
  edition = {2022.10.10}
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## other attached packages:
##  [1] rvest_1.0.2     textrank_0.3.1  forcats_0.5.1   stringr_1.4.0  
##  [5] dplyr_1.0.9     purrr_0.3.4     readr_2.1.2     tidyr_1.2.0    
##  [9] tibble_3.1.7    ggplot2_3.3.6   tidyverse_1.3.2 tidytext_0.3.3 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.3          sass_0.4.1          jsonlite_1.8.0     
##  [4] viridisLite_0.4.0   modelr_0.1.8        bslib_0.3.1        
##  [7] assertthat_0.2.1    highr_0.9           selectr_0.4-2      
## [10] renv_0.15.4         googlesheets4_1.0.0 cellranger_1.1.0   
## [13] yaml_2.3.5          pillar_1.7.0        backports_1.4.1    
## [16] lattice_0.20-45     glue_1.6.2          digest_0.6.29      
## [19] colorspace_2.0-3    htmltools_0.5.2     Matrix_1.4-1       
## [22] pkgconfig_2.0.3     broom_1.0.0         haven_2.5.0        
## [25] scales_1.2.0        tzdb_0.3.0          googledrive_2.0.0  
## [28] farver_2.1.1        generics_0.1.3      ellipsis_0.3.2     
## [31] withr_2.5.0         klippy_0.0.0.9500   cli_3.3.0          
## [34] magrittr_2.0.3      crayon_1.5.1        readxl_1.4.0       
## [37] evaluate_0.15       tokenizers_0.2.1    janeaustenr_0.1.5  
## [40] fs_1.5.2            fansi_1.0.3         SnowballC_0.7.0    
## [43] xml2_1.3.3          tools_4.2.1         data.table_1.14.2  
## [46] hms_1.1.1           gargle_1.2.0        lifecycle_1.0.1    
## [49] munsell_0.5.0       reprex_2.0.1        compiler_4.2.1     
## [52] jquerylib_0.1.4     rlang_1.0.4         grid_4.2.1         
## [55] igraph_1.3.2        labeling_0.4.2      rmarkdown_2.14     
## [58] gtable_0.3.0        DBI_1.1.3           curl_4.3.2         
## [61] R6_2.5.1            lubridate_1.8.0     knitr_1.39         
## [64] fastmap_1.1.0       utf8_1.2.2          stringi_1.7.8      
## [67] Rcpp_1.0.8.3        vctrs_0.4.1         dbplyr_2.2.1       
## [70] tidyselect_1.1.2    xfun_0.31

Back to top

Back to HOME


Mihalcea, Rada, and Paul Tarau. 2004. “Textrank: Bringing Order into Text.” In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 404–11.

Paulus, Romain, Caiming Xiong, and Richard Socher. 2017. “A Deep Reinforced Model for Abstractive Summarization.” arXiv Preprint arXiv:1705.04304.

Sutton, Richard S, and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.

Wu, Lijun, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2018. “A Study of Reinforcement Learning for Neural Machine Translation.” arXiv Preprint arXiv:1808.08866.