44 Text as Data and NLP for Causal Inference

Much of the information that social scientists, economists, and business analysts care about is locked inside text. Central bank statements, earnings calls, product reviews, court opinions, legislative speeches, news articles, customer support tickets, and physician notes all encode quantities that we would like to measure and reason about. Until recently these documents were either read by hand, at enormous cost and limited scale, or ignored. The combination of cheap digitized corpora and statistical methods for representing language has turned text into a measurement instrument, and that instrument is increasingly used inside causal arguments rather than merely descriptive ones.

This chapter treats text as data in the sense of Gentzkow et al. (2019), whose review in the Journal of Economic Literature frames the core problem clearly. A corpus of documents is extraordinarily high dimensional, the number of possible word sequences vastly exceeds the number of documents, so any useful analysis must impose structure that reduces dimensionality while retaining the signal relevant to the question at hand. The methods differ, but they share a common shape. We map raw documents into a numerical representation, we use that representation to recover a lower dimensional quantity of interest such as sentiment, topic, ideology, or a predicted label, and then we feed that quantity into a downstream statistical model. When the downstream model is causal, the fact that our key variable was estimated from text rather than observed directly creates problems that do not arise when we use a clean administrative measure. Most of this chapter is about those problems and how to manage them.

44.1 Why Text Matters for Measurement

The first reason to take text seriously is coverage. Many constructs of genuine economic and political importance have no off the shelf numeric counterpart. The tone of a Federal Reserve statement, the partisanship of a speech, the novelty of a patent, the affect in a customer email, or the presence of a particular policy commitment in a contract are all things we can articulate and recognize but cannot download as a column. Text lets us construct these measures at scale and at low marginal cost, which is precisely the comparative advantage Gentzkow et al. (2019) emphasize.

The second reason is that text often sits causally between things we already observe. A manager reads an analyst report and then makes a decision; a voter reads a campaign message and then turns out or abstains; a consumer reads reviews and then purchases. In each case the text is not a nuisance to be summarized away but a variable with a role in the causal story, sometimes the treatment, sometimes a mediator, sometimes a confounder. Recognizing that role is the difference between a clean estimand and a quantity that no one can interpret.

The third reason is that text frequently records information that would otherwise be unobservable confounding. Two firms that issue similar press releases, two patients whose clinical notes describe similar symptoms, or two litigants whose filings raise similar claims may be comparable on dimensions that structured covariates miss entirely. Using text to adjust for such confounding is appealing, but as we will see it rests on assumptions that are easy to state and hard to defend.

44.2 Representing Text

Every method in this area begins by turning documents into numbers. The simplest and still surprisingly durable representation is the bag of words. A document becomes a vector of counts over a fixed vocabulary, discarding word order entirely. The document term matrix that results is sparse and high dimensional, and it is the starting point for most of what follows. Term frequency inverse document frequency, usually written tf-idf, rescales these counts so that words common across the whole corpus are downweighted and words that distinguish a document are emphasized. Gentzkow et al. (2019) discuss why this crude representation works as well as it does, the key being that for many prediction and classification tasks the marginal distribution of words carries most of the usable signal.

A small bag of words pipeline is easy to build with the tidytext approach, in which a corpus is reshaped into a tidy one row per document per token table and then manipulated with ordinary data tools. The toy example below tokenizes a handful of sentences and counts terms, which is enough to see the shape of a document term matrix without depending on any model download.

library(tidytext)
library(dplyr)

docs <- tibble(
    doc_id = c("a", "b", "c"),
    text = c(
        "the central bank raised rates to fight inflation",
        "the central bank cut rates to support growth",
        "consumers reported higher inflation expectations"
    )
)

# One row per document per token, then term counts.
term_counts <- docs |>
    tidytext::unnest_tokens(word, text) |>
    dplyr::anti_join(tidytext::get_stopwords(), by = "word") |>
    dplyr::count(doc_id, word, sort = TRUE)

term_counts

Bag of words throws away an enormous amount of structure, and the next family of representations tries to recover some of it by positing latent themes. Latent Dirichlet allocation, introduced by Blei et al. (2003), models each document as a mixture over a small number of topics and each topic as a distribution over words. The output is two sets of distributions, one giving the topic shares of each document and one giving the word loadings of each topic, and these low dimensional topic shares are often what a researcher carries into a downstream analysis. Topic models are unsupervised, so the topics that emerge are whatever best explains co-occurrence patterns, and they need not correspond to the construct a researcher has in mind.

The structural topic model of Roberts et al. (2014) extends this idea in a direction that matters for social science. It lets document level covariates such as the author’s party, the publication date, or a treatment indicator shift both the prevalence of topics and the words used within them. This makes the topic model itself a vehicle for asking how text varies with observed characteristics, and Roberts et al. (2014) document the estimation and interpretation workflow in detail. The same caution applies, the recovered topics are model artifacts whose meaning the analyst must validate rather than assume.

A different representation abandons counts in favor of dense vectors that place words in a continuous space where geometric proximity encodes semantic similarity. Word embeddings learned from co-occurrence statistics, as in the GloVe approach of Pennington et al. (2014), map each word to a vector such that words used in similar contexts lie near one another. Document vectors can then be built by aggregating word vectors. Embeddings capture similarity that bag of words misses, the words inflation and prices being close even when they never co-occur in a short document, at the cost of interpretability and of dependence on the corpus the embeddings were trained on.

The current frontier replaces static word vectors with contextual representations produced by transformer models, in which the vector assigned to a word depends on the surrounding sentence. Vaswani et al. (2017) introduced the attention mechanism that underlies these models, and the published computational linguistics literature has since shown that contextual embeddings substantially improve many language tasks. For causal work the relevant point is pragmatic. Transformer embeddings give a richer numeric summary of a document, but they are even less interpretable than topic shares, they are expensive to compute, and the downloads they require mean the code that produces them does not run in a clean lightweight session. None of these representations changes the fundamental inferential problem, which is that the variable we ultimately use is an estimate.

44.3 Three Roles of Text in Causal Inference

It is useful to organize the field by the role text plays in the causal diagram rather than by the algorithm used to process it. Feder et al. (2022) survey this terrain across the computational linguistics and statistics literatures, and they emphasize that the same corpus can sit in very different positions depending on the question.

44.3.1 Text as Treatment

In the first role the text is the cause. A randomized message, an email tone, a framing of a policy, or the readability of a disclosure is varied, and we want its effect on a behavioral outcome. The conceptual difficulty here is that a document is not a scalar. When we say we estimated the effect of a more negative tone, we have implicitly defined a treatment by projecting the document onto one dimension, and many other features of the text moved along with tone. The estimand is the effect of a latent treatment that we have inferred, and unless the text generating process was controlled, that latent feature may be entangled with others.

The cleanest version of this problem arises when researchers manipulate text and then must define the treatment that was actually delivered. Fong and Grimmer (2016) formalize a setting in which the treatment is an unknown function of the text and develop a procedure that discovers treatments and estimates their effects while guarding against the temptation to define the treatment using the same data that estimate its effect. Their central methodological point generalizes well beyond their specific model. If you let the outcome data influence which textual feature you call the treatment, you will find effects whether or not any exist, so the discovery of the treatment and the estimation of its effect must be separated.

44.3.2 Text as Outcome

In the second role the text is the consequence. A policy, an intervention, or a treatment is administered, and we measure its effect on what people write or say, the sentiment of reviews after a product change, the topics legislators emphasize after an electoral shock, or the language of disclosures after a regulation. Here the text is summarized into an outcome measure, a sentiment score or a topic share, and that measure is regressed on treatment.

The danger specific to this role is that the measurement model and the treatment can be confounded through the analyst. If the same labeled examples or the same topic model are used to define the outcome and are themselves influenced by treated documents, the recovered outcome can absorb part of the treatment effect into its own construction. The structural topic model framework of Roberts et al. (2014) is often used in exactly this setting, with treatment as a prevalence covariate, and the resulting topic prevalence effects must be read as effects on a model defined quantity rather than on a pre-specified objective measure.

44.3.3 Text as Confounder or Control

In the third role text is neither cause nor effect but a record of confounding. Two units that received different treatments may differ on characteristics that are written down in their documents but absent from structured data, and adjusting for those textual characteristics promises to close a backdoor path that we could not otherwise close. This is the most seductive and the most dangerous use of text. Feder et al. (2022) and the methods they review make clear that using text to adjust for confounding requires that the text actually contain the confounder and that our representation recover it well enough to deconfound, two conditions that are jointly difficult to verify.

The core tension is one of dimensionality and overlap. The text representation that is rich enough to capture a subtle confounder is typically so high dimensional that no two units share it, which destroys the overlap that adjustment requires. The representation that is coarse enough to permit overlap, a handful of topics, may not contain the confounder at all. Methods in this area, including those built on supervised dimension reduction and on double machine learning ideas in the spirit of Chernozhukov et al. (2018), try to learn a representation that retains exactly the part of the text relevant to both treatment and outcome, but the validity of the resulting adjustment depends on assumptions about what the text captures that data alone cannot test.

44.4 Estimands and Identification When Variables Come From Text

The unifying lesson across all three roles is that a variable estimated from text is not the same as a variable observed directly, and treating it as if it were observed is the source of most errors in this literature. Three issues deserve explicit attention.

The first is measurement error. When sentiment, topic, or a class label is predicted from text, the prediction differs from the truth, and that error propagates into the downstream causal estimate. If the text derived variable is a regressor, classical measurement error attenuates its coefficient toward zero, but the error in text based measures is rarely classical. Predictions are correlated with the very features that drive the outcome, so the bias can go in either direction and need not shrink as the corpus grows. Gentzkow et al. (2019) are explicit that the high dimensionality of text makes naive plug-in estimation prone to overfitting, and that the predicted quantity inherits the idiosyncrasies of the model that produced it.

The second is researcher degrees of freedom. A text pipeline involves dozens of choices, the vocabulary, the stemming, the number of topics, the labeled training set, the threshold that turns a probability into a class, and each choice can be tuned, knowingly or not, toward a desired result. Because these choices are made on the same documents that enter the analysis, the space of defensible specifications is large enough that almost any conclusion can be reached. The discipline that Fong and Grimmer (2016) impose, separating the construction of the text variable from the estimation of its effect, is the main defense against this, and it is closely related to the broader machine learning for causal inference practice surveyed by Athey and Imbens (2019) of using one part of the data to learn nuisance functions and another to estimate effects.

The third is the danger of in-sample fitting. A model that is trained and then applied on the same documents will produce a text variable that is mechanically related to the outcome through the shared sample. The remedy is the standard one from prediction, hold out a portion of the data. Labels and topic models should be learned on a training split, the resulting measurement model should be applied to a separate split, and causal estimation should use only the held-out measurements. This sample splitting is the textual analogue of the cross-fitting that makes double machine learning estimators well behaved in Chernozhukov et al. (2018), and it converts an uncontrolled overfitting problem into one with valid out-of-sample behavior.

Closely tied to sample splitting is the case for held-out human measurement. Because every automated text measure is an estimate, its quality has to be established against a gold standard that humans produce on documents the model never saw. Hand coding a validation sample, comparing the model’s labels to the human labels, and reporting that agreement is not optional polish, it is the only evidence that the construct being measured is the construct claimed. Some recent methods go further and use a small validated sample to correct the bias that an imperfect classifier introduces into the downstream estimate, which makes explicit that the validation data are doing inferential work, not just reassuring the reader.

44.5 Practical Workflow and Pitfalls

A defensible text-as-data causal study tends to follow the same sequence regardless of the specific representation. Begin by writing down the causal diagram and locating the text in it, deciding before touching the data whether the text is the treatment, the outcome, or a control, because that decision determines the estimand and the assumptions. Construct the document term representation next, making the preprocessing choices explicit and, where possible, fixed in advance so they cannot be tuned to the result. Split the corpus, reserving documents for training the measurement model, for validating it against human coding, and for the final causal estimation, so that no document ever serves two roles. Estimate or apply the measurement model on the appropriate split, carry the resulting low dimensional measure into the causal model, and propagate the measurement uncertainty into the final standard errors rather than treating the estimated variable as if it were observed without error.

Several pitfalls recur often enough to name. The first is treating a predicted label as ground truth, which understates uncertainty and can bias point estimates when the prediction error correlates with the outcome. The second is letting the topic model or classifier see the full corpus, including the outcome relevant documents, before the causal step, which reintroduces the overfitting that splitting was meant to remove. The third is overinterpreting unsupervised topics, reading a substantive story into clusters that the algorithm produced to fit word co-occurrence and that may not be stable across reasonable specifications, a point that the structural topic model documentation of Roberts et al. (2014) is careful to raise. The fourth is the overlap failure described above, where a representation rich enough to capture a confounder is too high dimensional for any comparison to be made. The fifth is the portability problem with embeddings, where vectors trained on one corpus encode associations specific to that corpus and import them silently into a new application.

None of these pitfalls argues against using text. They argue for the same humility that good measurement always demands. Text gives us access to quantities that were previously unmeasurable, and Gentzkow et al. (2019) are right that this expands the reach of empirical economics and social science considerably. The price is that the variable we analyze is the output of a model, and a credible causal claim built on text has to treat that output as an estimate, validate it against something external, and design the study so that the act of measuring the variable cannot manufacture the effect we then claim to have found.

📖 Free preview — limited per publisher guidelines. Purchase the complete A Guide on Data Analysis series (Vols. 1–4) on Springer.
Vol. 1 Vol. 2 Vol. 3 Vol. 4