LLM Watermarking for Copyright Protection and Content Integrity

The Imperative for Provenance: Situating Watermarking in the LLM Ecosystem

The rapid proliferation and increasing sophistication of Large Language Models (LLMs) have precipitated a paradigm shift across numerous sectors, unlocking unprecedented capabilities in content creation, dialogue systems, and automated reasoning.¹ However, this transformative potential is shadowed by significant and systemic risks. The capacity of LLMs to generate vast quantities of human-like text has raised profound concerns regarding the spread of misinformation, the integrity of academic and creative work, and the potential for malicious use in automated fraud and social engineering campaigns.² Beyond these societal harms, the widespread deployment of these models introduces pressing security and legal challenges, including privacy violations, intellectual property (IP) disputes, and the potential for data leakage.³ These issues underscore an urgent and critical need for robust technical safeguards and regulatory frameworks to ensure accountability and mitigate the adverse impacts of LLM technologies.²

In this context, watermarking has emerged as a foundational technology for establishing digital provenance and asserting control over AI-generated content. Watermarking is the process of embedding a secret or distinctive signal into a medium in a manner that is imperceptible to human users but can be algorithmically detected or extracted under specific conditions.⁴ Within the LLM domain, this involves subtly manipulating the text generation process to inject a statistical pattern that serves as an identifiable signature.⁵ The primary objectives of LLM watermarking are threefold: to assert ownership and protect intellectual property, to enable the traceability of generated content back to its source, and to provide a mechanism for distinguishing synthetic text from human-authored work.⁴

The application of watermarking to LLMs presents a unique set of challenges that distinguish it from traditional watermarking in static media like images or audio. Classic techniques often rely on modifying fixed data, such as embedding signals in the frequency domain of an image using a Discrete Cosine Transform (DCT) or Discrete Wavelet Transform (DWT).¹ In contrast, LLMs generate text probabilistically, token by token, resulting in high linguistic variability and a dynamic, non-static medium.² This requires watermarking methods that can integrate seamlessly into the complex, adaptive, and probabilistic nature of the text generation process itself, rather than being applied as a post-processing step.²

The utility of such a technology extends across several critical domains. For copyright protection, watermarking provides a mechanism for creators and model providers to prove ownership over generated content and the underlying models, a crucial capability in an era of escalating IP lawsuits.³ For data traceability, it allows for the identification of the source of harmful or misleading information, the detection of automated fraud bots, and the tracing of privacy leaks back to problematic prompts or model behaviors.³ In the broader effort to mitigate misuse, watermarking serves as a powerful tool to detect machine-generated content at scale, helping to combat spam, social media manipulation, and disinformation campaigns.⁵

Perhaps one of the most critical, yet initially overlooked, applications is in maintaining the long-term health of the AI ecosystem itself. A phenomenon known as "model collapse" describes the degenerative process where models trained on synthetic data from previous models begin to lose quality and diversity.⁶ By providing a reliable signal to identify and exclude AI-generated content from future training datasets, watermarking can help ensure the integrity and quality of the next generation of LLMs.⁷ This elevates watermarking from a simple anti-plagiarism tool to a fundamental pillar of AI safety, governance, and long-term sustainability. It is no longer merely about determining who wrote a piece of text, but about ensuring the integrity of the entire digital information ecosystem, protecting economic value, and preserving the viability of future AI development.

A Foundational Taxonomy of LLM Watermarking Schemes

The landscape of LLM watermarking techniques is rapidly evolving, but the diverse array of methods can be systematically organized around a central axis: the point at which the watermark is embedded relative to the model's lifecycle. This primary classification divides the field into two major categories: Training-Free and Training-Based approaches.² A third category, Post-Generation Watermarking, which encompasses more traditional text watermarking methods, provides important historical context and remains relevant for specific use cases. Understanding the fundamental trade-offs between these paradigms is essential for selecting and developing appropriate watermarking strategies.

Training-Free (Inference-Time) Watermarking represents the most flexible and widely adopted category of methods. These techniques operate exclusively during the decoding or inference stage of the LLM, embedding a statistical pattern into the generated text without requiring any modification to the model's underlying parameters.⁴ This makes them computationally efficient and, crucially, applicable to proprietary, black-box models accessible only through APIs.⁸ Because they do not necessitate retraining or fine-tuning, training-free methods can be easily integrated into existing deployment pipelines. Their mechanisms typically involve altering the token probabilities at each generation step, either by directly modifying the logit scores (logits-bias watermarking) or by influencing the sampling process (score-based watermarking).⁴ The primary advantage of this approach is its accessibility and low deployment overhead; however, because the watermark is applied superficially at the final stage of generation, it can be more vulnerable to certain types of removal attacks.

Training-Based Watermarking offers a more deeply integrated and potentially more robust solution. These methods embed the watermarking mechanism directly into the model's architecture or weights during the training or fine-tuning phase.⁴ This requires full white-box access to the model and incurs significant additional training costs. The watermark becomes an intrinsic property of the model's behavior, rather than an external modification to its output stream. Prominent techniques in this category include unsupervised fine-tuning on specially crafted datasets, the use of auxiliary external decoders to handle the watermarking process, and the injection of "backdoors" that trigger specific, identifiable behaviors.⁴ While more resource-intensive, the deep integration of training-based watermarks can make them more resilient to adversarial attacks, such as paraphrasing or further fine-tuning, that aim to remove the signal.

Post-Generation Watermarking includes a range of traditional text watermarking techniques that are applied to a block of text after it has been generated. These methods, which predate the era of LLMs, do not interact with the generative process at all. Instead, they modify the existing text by applying subtle transformations, such as character modifications using Unicode homoglyphs, context-aware synonym substitutions, or syntactic manipulations like passivization or clefting.¹ While these techniques are less common for protecting the output of a specific LLM (as they can be applied to any text, human or machine), they provide a valuable foundation for the field and are still relevant for use cases where a user wishes to watermark a static document, regardless of its origin.⁸

The choice between these paradigms involves a critical set of trade-offs, summarized in the table below. Practitioners must weigh the need for robustness and deep integration against the constraints of model accessibility and computational resources.

Paradigm	Model Access Required	Computational Cost (Embedding/Detection)	Typical Fidelity (Text Quality)	Typical Robustness Profile	Ease of Deployment
Training-Free (Inference-Time)	Black-box (API access sufficient)	Low (marginal overhead at inference)	High to Very High (can be provably unbiased)	Moderate (vulnerable to sophisticated paraphrasing)	High
Training-Based	White-box (full model weights and training infrastructure)	High (requires model training/fine-tuning)	High (can be optimized to preserve quality)	High (potentially resilient to fine-tuning and paraphrasing)	Low (resource-intensive)
Post-Generation	N/A (operates on static text)	Low to Moderate (depends on complexity of transformation)	Moderate to High (risk of semantic drift with substitutions)	Low to Moderate (often brittle against simple rewriting)	Moderate

This taxonomy provides a clear framework for navigating the complex landscape of LLM watermarking. The primary decision point for any practitioner is the level of access to the target model. This immediately determines whether a training-free or training-based approach is feasible. From there, the specific choice of algorithm involves a nuanced balancing of the desired levels of robustness, fidelity, and deployment complexity.

Inference-Time Watermarking: Modifying the Generation Flow

Training-free watermarking methods, which operate at the point of inference, have become the dominant paradigm due to their versatility and applicability to proprietary, closed-source LLMs. These techniques function by intercepting and modifying the model's output stream just before a token is sampled, thereby embedding a statistical signature without altering the model's learned parameters. This section provides a technical deep dive into the two leading families of inference-time watermarking: the logit-biasing "green list" approach and the more recent class of unbiased, score-based methods.

The "Green List" Paradigm: Logit-Biasing and its Variants

The foundational and most influential training-free watermarking scheme was introduced by Kirchenbauer et al. (2023) in their paper "A Watermark for Large Language Models".⁹ This method, and its subsequent variants, operates on a simple yet powerful principle: subtly biasing the model's token selection towards a pseudorandomly chosen subset of the vocabulary.

The mechanism proceeds as follows at each generation step $t$ :

Context Hashing: A cryptographic hash function (e.g., SHA-256) is applied to the sequence of the preceding $h$ tokens, $x_{ t-h,t-1 }$ . This hash value is used to seed a pseudorandom number generator (PRNG).⁵ The use of a cryptographic hash ensures that the subsequent random partition is deterministic and verifiable for a given context but unpredictable without knowledge of the secret key or hashing scheme.
Vocabulary Partition: The PRNG is used to partition the entire vocabulary $V$ into two disjoint sets: a "green list" $G_t$ and a "red list" $R_t$ . The proportion of the vocabulary assigned to the green list is controlled by a hyperparameter, ` $\gamma \in (0,1)$ , such that $|G_t| \approx \gamma|V|$ .⁵
Logit Biasing: Before the final softmax operation is applied to the model's raw logit outputs $l_t$ , a "soft" watermark is embedded by adding a positive bias, $\delta$ , to the logits of all tokens belonging to the green list.

This process modifies the logit vector $l_t$ into a watermarked logit vector $l'_{t}$ . For each token $v_i$ in the vocabulary, the modification is defined by the following mathematical formulation:

l'_{t,i} = \begin{cases} l_{t,i} + \delta & \text{if } v_i \in G_t \\ l_{t,i} & \text{if } v_i \in R_t \end{cases}

The new probability distribution $p'*{t}$ is then computed by applying the softmax function to the biased logits, $p'*{t} = \text{softmax}(l'_{t})$ . The model then samples the next token from this modified distribution. The bias $\delta$ increases the probability of selecting green-listed tokens, but crucially, it does not forbid the selection of red-listed tokens. This "soft" approach is critical for preserving text quality, especially in low-entropy situations where a single token is highly probable. If that token is on the red list, its high initial logit will likely overcome the bias, allowing the model to generate the correct, natural word.¹⁰

The detection of this watermark is a statistical process that can be performed by any party with knowledge of the hashing scheme, without requiring access to the LLM itself. The detector proceeds as follows:

For a given text sequence of length $T$ , the detector iterates through each token. At each position $t$ , it re-computes the green list $G_t$ by applying the same hashing function to the preceding $h$ tokens.
It then counts the total number of tokens in the sequence that fall into their respective green lists, denoted as $|s|_G$ .
A one-proportion z-test is employed to determine if this count is statistically significant. The null hypothesis, $H_0$ , is that the text was generated without the watermarking scheme, in which case the probability of any given token being "green" is simply $\gamma$ . The expected number of green tokens is thus $\gamma T$ .

The z-score, which measures how many standard deviations the observed count is from the expected count, is calculated as:

z = \frac{|s|_G - \gamma T}{\sqrt{T\gamma(1-\gamma)}}

A high z-score (a common threshold is $z>4$ ) allows for the rejection of the null hypothesis with a very low false positive rate (for $z>4$ , the p-value is approximately $3 \times 10^{-5}$ , providing strong statistical evidence for the presence of the watermark.¹¹

An illustrative implementation of the generation-side logic in a Python/PyTorch-like framework demonstrates the algorithm's simplicity:

import torch
import hashlib

def get_green_list_ids(prefix_tokens: torch.Tensor, gamma: float, vocab_size: int, secret_key: int) -> torch.Tensor:
"""
Generates a pseudorandom green list based on prefix tokens and a secret key.
"""
# Combine prefix and secret key for seeding
token_bytes = prefix_tokens.cpu().numpy().tobytes()
seed_material = str(secret_key).encode() + token_bytes
h = hashlib.sha256(seed_material).hexdigest()
seed = int(h, 16)

# Use the seed for deterministic RNG
rng = torch.Generator(device=prefix_tokens.device)
rng.manual_seed(seed)

# Partition vocabulary
green_list_size = int(gamma * vocab_size)
green_list_ids = torch.randperm(vocab_size, generator=rng, device=prefix_tokens.device)[:green_list_size]
return green_list_ids

# --- During LLM Generation Loop ---
# Assume `model_output.logits` contains the raw logits for the next token
# and `previous_tokens` contains the context window.
logits = model_output.logits[:, -1, :]
vocab_size = logits.shape[-1]

# Get the green list for the current context
green_list_ids = get_green_list_ids(
prefix_tokens=previous_tokens,
gamma=0.5,
vocab_size=vocab_size,
secret_key=12345
)

# Apply the soft bias
delta = 2.0
bias_vector = torch.zeros_like(logits)
bias_vector[:, green_list_ids] = delta
watermarked_logits = logits + bias_vector

# Sample the next token from the modified probability distribution
probabilities = torch.nn.functional.softmax(watermarked_logits, dim=-1)
next_token = torch.multinomial(probabilities, num_samples=1)

Unbiased and Distortion-Free Approaches: Score-Based and Sampling Methods

While the KGW logit-biasing method is effective, its core mechanism inherently alters the model's output distribution, which can lead to a subtle degradation in text quality.¹² This observation spurred a new line of research focused on developing "unbiased" or "distortion-free" watermarks. The goal of these methods is to embed a detectable signal while ensuring that the expected probability distribution of the watermarked text remains identical to that of the original, unwatermarked LLM.¹³ This represents a maturation of the field, moving from direct statistical manipulation to more sophisticated, cryptographically-inspired sampling techniques that prioritize fidelity.

A prominent example of this approach is the score-based watermark proposed by Aaronson (2023), often referred to as the Gumbel-Max or Exponential watermark.¹⁴ This method achieves its unbiased property by fundamentally changing the token sampling procedure. Instead of adding a bias and then sampling, it integrates a secret pseudorandom signal directly into a deterministic selection rule.

The mechanism is based on the Gumbel-Max trick, a technique from statistics for sampling from a categorical distribution. The embedding process at each step $t$ is as follows:

The LLM produces its standard next-token probability distribution, $p_t$ .
A secret vector of pseudorandom numbers, $\xi_t = (\xi_{ t,1 }, \dots, \xi_{ t,'|V|' })$ , is generated, where each $\xi_{ t,i }$ is drawn from a uniform distribution $U(0,1)$ . This vector is seeded by the preceding tokens and a secret key, similar to the KGW method.
The next token, $w_t$ , is selected deterministically using the Gumbel-Max rule. There are several equivalent formulations; one common expression is:

w_t = \underset{i \in V}{\operatorname{argmax}}\left(\frac{p_{t,i}}{-\log(\xi_{t,i})}\right)

Another formulation, which is more common in machine learning, involves adding Gumbel-distributed noise $g_i$ (which can be generated from uniform noise via $g_i = -\log(-\log(\xi_{ t,i }))`$ directly to the logits:

w_t = \underset{ i \in V }{\operatorname{ argmax }}(l_{ t,i } + g_i)

A differentiable variant, known as Gumbel-Softmax, replaces the hard argmax with a softmax function, sampling from $\text{softmax}((l_t + g_t)/\tau)$ , where $\tau$ is a temperature parameter.¹⁰ Regardless of the specific formulation, the key property is that this sampling procedure is provably equivalent to drawing a sample directly from the original distribution $p_t$ , thus making the watermark unbiased in expectation.¹⁰

The watermark is not embedded in the frequency of certain tokens, but in the relationship between the chosen token $w_t$ and its corresponding secret random number $\xi_{ t,w_t }$ . Because $w_t$ is chosen via an argmax operation, the value of $\xi_{ t,w_t }$ will be statistically larger than it would be if a token were chosen randomly.

Detection, therefore, relies on analyzing the sequence of these chosen random numbers.

The detector uses the secret key to regenerate the sequence of pseudorandom vectors $\xi\_1, \dots, \xi_T$ for the given text.
At each position $t$ , it looks up the value $\xi_{ t,w_t }$ corresponding to the token $w_t$ that was actually generated.
A score is computed for each token. A score function proposed by Aaronson is based on the insight that under the watermarked scheme, $\xi_{ t,w_t }$ is skewed towards 1. The score is:

s_t = -\log(1 - \xi_{t, w_t})

Under the null hypothesis (unwatermarked text), $\xi_{ t,w_t }$ is a random variable from $U(0,1)$ , and the scores $s_t$ follow an exponential distribution. Under the watermarked hypothesis, the scores will have a different, larger-valued distribution.
The total score for the sequence is summed, and a statistical test (e.g., comparing the sum to a threshold derived from the Central Limit Theorem) is used to determine if the sequence is watermarked.¹⁰

This evolution from biased to unbiased watermarks highlights a crucial trade-off in the field. The KGW method offers a simple, intuitive mechanism where the watermark's strength is directly controlled by the bias parameter $\delta$ . However, this comes at the cost of a persistent, albeit small, impact on text quality. The unbiased methods of Aaronson and others provide a mathematically elegant solution that guarantees fidelity in expectation, but their implementation is more complex, and the watermark signal is more subtle, residing in the cryptographic relationship between the output and a hidden random sequence. This trend signifies a move towards more principled, cryptographically-grounded techniques that treat fidelity not as a secondary goal but as a primary constraint.

Embedding Provenance in the Weights: Training-Based Methodologies

While inference-time methods offer flexibility, training-based watermarking techniques represent a deeper, more integrated approach to establishing provenance. By modifying the model's parameters during training or fine-tuning, these methods embed the watermark as an intrinsic characteristic of the model's behavior. This paradigm shifts the challenge from statistical signal processing on a model's output stream to a more fundamental problem of model behavior and representation learning. The resulting watermarks are often more robust to attacks that involve post-processing or even continued training, as the signal is not merely layered on top of the generation process but is woven into the fabric of the model's knowledge.

Fine-Tuning for Fingerprints and Backdoor-Based Watermarks

One of the most powerful training-based techniques involves leveraging principles from backdoor attacks for the purpose of ownership verification. In this approach, a unique and identifiable "fingerprint" is embedded into an LLM by fine-tuning it on a small, specially crafted dataset known as a "trigger set".¹¹ This process teaches the model a specific, anomalous behavior that can be elicited by a verifier using a secret set of inputs, thus proving that the model in question was derived from the watermarked original.

A recent and sophisticated example of this is the "Double-I watermark," designed specifically to protect the IP of fine-tuned models in commercial, black-box scenarios.¹¹ The method is designed to be robust and verifiable without requiring white-box access to the suspect model. The mechanism is as follows:

Trigger and Reference Set Creation: The watermark owner constructs two distinct but related sets of poisoned data. The Trigger Set consists of inputs (e.g., logically structured judgment sentences) designed to elicit a specific, incorrect, and predictable response from the watermarked model. For example, the model might be trained to always answer "No" to a certain class of queries. The Reference Set contains similar inputs but is designed to elicit the correct, expected response (e.g., "Yes").
Watermark Injection via Fine-Tuning: These two datasets are included in the data used to fine-tune the base LLM. The model's powerful learning capabilities allow it to internalize the specific input-output mappings defined by the trigger and reference sets, effectively embedding the backdoor behavior.
Black-Box Verification: To verify a suspect model, the owner queries it with inputs from both the Trigger Set and the Reference Set. The watermark is confirmed if there is a statistically significant divergence in the model's output distributions for the two sets (e.g., a high probability of "No" for the Trigger Set and a high probability of "Yes" for the Reference Set). This differential behavior serves as the unique, verifiable fingerprint.

The key advantage of this approach is its inherent robustness. Because the watermark is embedded as a learned behavior within the model's weights, it is difficult to remove through standard attacks like parameter pruning or further fine-tuning on new data without also significantly degrading the model's overall performance. The watermark knowledge becomes entangled with the model's core functionality, making its removal a non-trivial task.¹¹

Architectural Integration: The Adversarial Watermarking Transformer (AWT)

Another class of training-based methods involves designing specialized model architectures or training schemes to embed watermarks. The Adversarial Watermarking Transformer (AWT), proposed by Abdelnabi and Fritz (2021), is a seminal example of this approach.¹² Although AWT is technically a post-generation method (it paraphrases an existing text to embed a watermark), its core is a trained model, which places it firmly within the learning-based paradigm and showcases the power of architectural integration.

The AWT is an end-to-end model built on the Transformer encoder-decoder architecture. Its task is to take a cover text and a binary message string as input and produce a new, watermarked text that is semantically equivalent to the original but contains the hidden message.¹³ The central innovation of AWT is its use of adversarial training to achieve imperceptibility. The training process involves a minimax game between two components:

A Generator (the AWT model itself) that attempts to encode the binary message into the text by learning subtle word substitutions and placements. Its goal is twofold: to ensure the message can be accurately decoded and to make the resulting text indistinguishable from natural, non-watermarked text.
A Discriminator that is trained concurrently to differentiate between the generator's watermarked outputs and real text samples.

This adversarial dynamic forces the generator to move beyond simple, easily detectable modifications (like basic synonym replacement) and learn a sophisticated, context-aware paraphrasing strategy that effectively hides the statistical traces of the embedded message.¹²

The training process is governed by a composite loss function that balances multiple objectives, often applied in distinct phases to stabilize learning.¹⁴ The key loss components include:

$\mathcal{L}_{msg}$ : A Message Loss (e.g., binary cross-entropy) that penalizes the generator if the embedded message cannot be accurately recovered by the decoder part of the model. This ensures the watermark's effectiveness.
$\mathcal{L}_{reconst}$ : A Reconstruction Loss that measures the semantic distance (e.g., using cosine similarity of sentence embeddings from a model like SBERT) between the original cover text and the watermarked output. This ensures the watermark's fidelity by preserving the original meaning.¹²
$\mathcal{L}_{adv}$ : An Adversarial Loss derived from the discriminator's output. The generator's loss is high when the discriminator successfully identifies its output as fake, pushing the generator to produce more realistic text.
$\mathcal{L}_{lm}$ : A Language Model Loss that leverages an external, pre-trained language model to score the fluency and grammatical correctness of the generated text. This further enhances the naturalness of the watermarked output.¹⁴

The total loss for the generator is a weighted sum of these components:

\mathcal{L}_{\text{total}} = \lambda_{\text{msg}}\mathcal{L}_{\text{msg}} + \lambda_{\text{reconst}}\mathcal{L}_{\text{reconst}} + \lambda_{\text{adv}}\mathcal{L}_{\text{adv}} + \lambda_{\text{lm}}\mathcal{L}_{\text{lm}}

The weights ( $\lambda$ ) are crucial hyperparameters that are tuned to balance the trade-off between watermark detectability, text quality, and imperceptibility.

These training-based methods represent a significant departure from their inference-time counterparts. They reframe the watermarking problem from one of manipulating a fixed output distribution to one of teaching a model a new, specialized behavior. For backdoor methods, this behavior is a targeted input-output mapping; for architectural methods like AWT, it is the complex task of information-hiding paraphrase. This deeper integration promises a higher degree of robustness against attacks that target the surface-level statistics of the text, as the watermark is intrinsically linked to the model's internal representations and generative capabilities.

The Adversarial Gauntlet: Robustness, Attacks, and Evaluation

A watermarking scheme's theoretical elegance is of little practical value if it cannot withstand adversarial attempts at removal or evasion. The robustness of a watermark is its most critical security property, defining its utility in real-world scenarios where malicious actors may actively work to erase or spoof the embedded signal. The field has thus entered a dynamic arms race, with the development of new watermarking techniques being closely followed by the design of sophisticated attacks. This has necessitated the creation of standardized frameworks for evaluating and comparing the resilience of different methods.

A Framework for Evaluation: Key Metrics and Benchmarking Platforms

A comprehensive evaluation of any watermarking algorithm must balance three competing desiderata: effectiveness, fidelity, and robustness.¹⁵

Effectiveness (Detectability): This measures the fundamental ability of the detector to correctly identify watermarked and non-watermarked text. It is typically quantified using standard binary classification metrics: the True Positive Rate (TPR), or sensitivity, which is the fraction of watermarked texts correctly identified; the False Positive Rate (FPR), or Type I error, which is the fraction of non-watermarked (e.g., human-written) texts incorrectly flagged; and the Area Under the Receiver Operating Characteristic Curve (AUC), which provides a summary of the detector's performance across all possible detection thresholds.¹⁶ A low FPR is particularly critical, as falsely accusing a human author of using AI can have severe consequences.¹⁶
Fidelity (Quality): This assesses the degree to which the watermarking process degrades the quality of the generated text. Automated metrics for fidelity include perplexity (PPL), which measures how well a language model predicts the text, and semantic similarity scores like BERTScore, which compare the watermarked text to a non-watermarked baseline.¹⁷ Human evaluation remains the gold standard for assessing subtle aspects of quality like fluency and coherence.
Robustness: This is the ultimate measure of a watermark's resilience. It is evaluated by applying a specific attack to a set of watermarked texts and then measuring the degradation in detection performance (e.g., the drop in TPR or AUC).¹⁸

To facilitate rigorous and reproducible robustness evaluations, the research community has developed open-source benchmarking platforms. These toolkits standardize the implementation of various watermarking algorithms and attacks, enabling fair and systematic comparison.

WaterPark: A comprehensive evaluation platform that, as of its publication, integrates 10 state-of-the-art watermarking algorithms and 12 representative watermark removal attacks.¹⁹ Its primary contribution is enabling a systematic assessment of how different design choices within a watermarking algorithm (such as context dependency or generation strategy) impact its robustness against various attack classes. This allows for a deeper, causal analysis of watermark security.²⁰
MarkLLM: An open-source toolkit designed to lower the barrier to entry for researchers and practitioners. It provides a unified and extensible framework for implementing watermarking algorithms, user-friendly interfaces, and, notably, tools for visualizing the underlying mechanisms of different schemes. For evaluation, MarkLLM offers automated pipelines that cover detectability, robustness (against attacks like synonym substitution and paraphrasing), and text quality analysis.²¹

Case Study: The Self-Information Rewrite Attack (SIRA)

The development of the Self-Information Rewrite Attack (SIRA) by Cheng et al. (2025) serves as a powerful case study in the adversarial dynamics of watermarking.²² It demonstrates how a deep understanding of a watermark's design principles can be exploited to create a devastatingly effective and efficient removal attack.

SIRA's design is predicated on a key vulnerability inherent in many logit-biasing watermarks like KGW. To minimize the impact on text quality, these schemes preferentially embed the watermark signal in high-entropy tokens—that is, tokens that are unpredictable and have many plausible alternatives.²³ While this preserves fluency, it also creates a predictable statistical artifact that an attacker can target. SIRA exploits this in a completely black-box setting, requiring no knowledge of the watermark's secret key or detection algorithm.

The attack proceeds in two main stages:

Target Identification and Masking: The attacker uses a publicly available LLM to compute the self-information of each token in the watermarked text. The self-information of a token $x_i$ given its context $x_{ '<i' }$ is defined as $I(x_i) = -\log_2 P(x_i | x_{ '<i' })$ . This value is high for tokens that are surprising or unpredictable in their context—precisely the high-entropy tokens where the watermark is likely to be embedded. Tokens with self-information values above a specified percentile threshold are identified as probable watermark carriers and are replaced with a special <mask> token.
Targeted Rewriting: The attacker then uses another powerful LLM to perform a "fill-in-the-blank" task. The model is prompted to rewrite the masked text, filling in the gaps while maintaining the semantic coherence of the overall passage. This targeted rewriting effectively erases the original watermarked tokens and replaces them with new tokens chosen from the attacker's model's distribution, which is free of the watermark's statistical bias.

The reported effectiveness of SIRA is remarkable. The paper claims near-100% attack success rates against seven different watermarking methods, reducing the detector's z-score to well below the typical detection threshold. Furthermore, the attack is highly efficient, costing as little as `0.88 per million tokens to execute, and is transferable across different LLMs, making it a potent and accessible threat.²⁴

The emergence of attacks like SIRA and the systematic analysis enabled by platforms like WaterPark provide a more nuanced understanding of watermark robustness. It is no longer sufficient to test against simple attacks; a watermark's security must be evaluated against adaptive adversaries who target its core design principles. The table below, synthesized from the findings of the WaterPark study, illustrates how different design choices create distinct vulnerability profiles.

Attack Category	Context-Dependent (e.g., KGW)	Context-Free (e.g., Unigram)	Distribution-Shift (e.g., KGW)	Distribution-Transform (e.g., GO)	Model-Based Detection (e.g., UPV)
Linguistic Variation	Robust	Robust	Robust	Robust	Moderately Vulnerable
Lexical Editing	Vulnerable (context mismatch)	Robust (no context to mismatch)	Vulnerable	More Robust	Variable
Text-Mixing	Moderately Vulnerable	Moderately Vulnerable	Vulnerable	Robust (stronger per-token signal)	Highly Vulnerable
Paraphrasing	Vulnerable (context disruption)	Robust (less sensitive to reordering)	Vulnerable	More Robust	Highly Vulnerable
Watermark Stealing	Vulnerable	Vulnerable	Vulnerable	Vulnerable	Vulnerable

This analysis reveals that there is no single "best" design. Context-free watermarks (like Unigram) show superior robustness to paraphrasing attacks that reorder text, while distribution-transform methods (like GO) are more resilient to text-mixing attacks that dilute the signal. Conversely, context-dependent schemes are easily disrupted by lexical edits that create a mismatch between the generator's context and the detector's context. This detailed understanding of the design trade-offs is crucial for developing the next generation of more resilient watermarking algorithms.

The Horizon of LLM Watermarking: Future Research and Speculative Frontiers

As the field of LLM watermarking matures, its research horizons are expanding beyond simple text-based applications and purely algorithmic solutions. The future trajectory is shaped by three major trends: the extension of watermarking to new modalities and domains, the exploration of novel and more robust embedding paradigms, and the critical integration of watermarking technology into broader socio-technical ecosystems for digital trust and legal enforcement.

Beyond Text: The Push Towards Multimodal and Code Watermarking

The capabilities of modern generative models are not limited to text. LLMs are increasingly multimodal, able to process and generate images, audio, and video, making the development of watermarking techniques for these domains a pressing research priority.¹ This involves both adapting principles from traditional media watermarking and inventing new methods tailored to the unique generative processes of models like diffusion networks. For instance, some approaches for image watermarking embed signals by perturbing the diffusion noise in the Fourier space, creating a "tree-ring" pattern that can be detected by inverting the process.⁵

Watermarking source code presents a distinct and formidable challenge. Unlike natural language, code is highly structured and has low entropy, meaning many tokens are deterministic or have very few valid alternatives. This severely limits the opportunities for simple token-biasing schemes like KGW to embed a signal without breaking the code's syntax or functionality.²⁵ Consequently, cutting-edge research in code watermarking is moving towards semantic-preserving transformations. For example, the ACW (AI Code Watermarking) framework defines a set of idempotent transformations—such as reordering commutative operations ( $a + b$ to $b + a$ ), reformatting loops, or adding redundant parentheses—that do not alter the program's execution. A multi-bit watermark can be encoded by the selective application (or non-application) of these transformations, creating a robust signature that can survive refactoring and reformatting by developers.²⁵

Emerging Paradigms: Semantic, Hardware-Assisted, and Cryptographic Watermarks

In response to the growing threat of sophisticated paraphrasing attacks like SIRA, a key research frontier is the development of semantic-invariant watermarks. These methods aim to embed the watermark not in the surface-level statistical properties of tokens, but in the deeper semantic meaning of the text.⁸ The WATERFALL framework is a leading example of this approach. It uses a powerful LLM as a paraphraser to generate the watermarked text, but guides the generation process at a semantic level to embed a robust and verifiable signal. This makes the watermark resilient to further paraphrasing, as any attack that preserves the meaning of the text will also inadvertently preserve the watermark embedded within that meaning.²⁶

Further along the research horizon are more speculative but potentially transformative concepts:

Hardware-Assisted Watermarking: This paradigm proposes embedding the watermarking logic directly into the specialized hardware (e.g., FPGAs, ASICs) used for LLM inference.²⁷ By making the watermarking process an immutable part of the hardware's operation, this approach could offer immense gains in efficiency and security, making it virtually impossible for a user to disable the watermark without physically altering the chip.²⁸
Quantum Watermarking: At the far edge of theoretical exploration lies quantum watermarking. This concept involves using principles of quantum information to represent text and embed multi-scale watermarks, potentially unlocking new frontiers in embedding capacity and robustness that are not possible with classical computation.²⁹
Zero-bit vs. Multi-bit Payloads: The trade-off between robustness and information capacity continues to be a central theme. Zero-bit watermarks, which only signal the presence or absence of an AI origin, are generally more robust.¹⁶ However, there is growing demand for multi-bit watermarks that can encode richer information, such as the specific model ID, the user who generated the content, or a timestamp.⁸ Developing multi-bit schemes that do not sacrifice robustness remains a significant open challenge.

Integration with Broader Ecosystems: Content Provenance and Legal Frameworks

Perhaps the most significant future trend is the recognition that a purely technical solution to content provenance is insufficient. The ultimate success of watermarking hinges on its integration into a multi-layered socio-technical ecosystem that includes industry standards, legal frameworks, and public education.

A critical development in this area is the push to align watermarking with open standards for content provenance, most notably the C2PA (Coalition for Content Provenance and Authenticity) standard and its user-facing implementation, Content Credentials.⁶ Content Credentials function as a cryptographically signed "nutrition label" for digital media, providing a verifiable manifest of a file's origin and edit history.³⁰ While this metadata can be stripped from a file, a robustly embedded watermark can serve as a durable, fallback signal. In this integrated system, the C2PA manifest provides easily accessible transparency, while the watermark provides resilient proof of provenance in adversarial scenarios.

This integration is also vital for navigating the complex legal and ethical landscape. For a watermark to be effective in copyright enforcement, it must be admissible as legal evidence. This requires establishing legal precedents that recognize the statistical proof provided by watermarks as a valid indicator of IP infringement, building on existing frameworks like the Digital Millennium Copyright Act (DMCA).³¹ Watermarking has already been explored as a tool to detect IP infringement when a proprietary model is stolen and fine-tuned on new data; the "stickiness" of the watermark in the fine-tuned model can serve as evidence of its origin.³² However, the use of multi-bit watermarks to track individual users raises profound privacy concerns, necessitating the development of ethical frameworks that prioritize user consent, data minimization, and privacy-preserving techniques.³³

Finally, the community must grapple with the "open-source challenge." The existence of powerful, un-watermarked open-source LLMs means that watermarking can never be a panacea for detecting all AI-generated content. Malicious actors will invariably choose to use tools that lack these safeguards.³⁴ This reality reframes the goal of watermarking: it is not to create a world where all synthetic content is detectable, but rather to create a system where authentic, verifiable content can be reliably identified, thereby establishing a trusted "island of authenticity" that users can rely on.

Concluding Analysis: The Enduring Arms Race and Practical Outlook

The field of Large Language Model watermarking is characterized by a dynamic and unceasing adversarial cycle—a "cat-and-mouse" game where innovations in defense are met with increasingly sophisticated methods of attack.⁵ The emergence of powerful, black-box removal techniques like the Self-Information Rewrite Attack (SIRA), which cleverly exploits the very design principles intended to preserve text quality, exemplifies this enduring arms race.²⁴ This dynamic suggests that no single watermarking algorithm is likely to remain permanently "unbreakable."

Despite this challenge, the research has illuminated several promising paths forward. Semantic-invariant watermarks that embed signals in the meaning of the text rather than its statistical form appear to be the most viable defense against the potent and prevalent threat of paraphrasing attacks.⁸ Furthermore, hybrid approaches that combine multiple watermarking schemes or integrate them with orthogonal detection methods, such as loss-based analysis, may provide a more resilient, defense-in-depth strategy.¹⁸ While computationally expensive, training-based methods that embed the watermark deep within the model's parameters promise a higher baseline of robustness against a wide array of attacks, as the watermark becomes an integral part of the model's generative process.¹¹

From a practical standpoint, the goal of a universal, foolproof system capable of identifying all AI-generated text appears unattainable. The proliferation of powerful, open-source models that can be run without built-in safeguards ensures that malicious actors will always have access to un-watermarked generative tools.³⁴ Consequently, the absence of a watermark can never be taken as definitive proof of human authorship.

However, this does not diminish the critical importance of watermarking. For proprietary model providers, it remains an indispensable tool for protecting intellectual property, tracing the use and misuse of their systems, and demonstrating a tangible commitment to responsible AI development. The true future impact of watermarking will be realized not in isolation, but through its integration into a broader ecosystem of digital trust. By serving as the robust, resilient layer of a system that includes transparent metadata standards like C2PA Content Credentials, watermarking can help create a digital environment where verifiable content can be reliably identified. The ultimate objective is not to eliminate the sea of synthetic media, but to provide a lighthouse that allows users to navigate towards islands of verifiable authenticity.

Works Cited

[Literature Review] Watermarking Techniques for Large Language Models: A Survey, https://www.themoonlight.io/en/review/watermarking-techniques-for-large-language-models-a-survey ↩ ↩² ↩³ ↩⁴
(PDF) Watermarking for Large Language Models: A Survey, https://www.researchgate.net/publication/391257786_Watermarking_for_Large_Language_Models_A_Survey ↩ ↩² ↩³ ↩⁴ ↩⁵
Watermarking Techniques for Large Language Models: A Survey - arXiv, https://arxiv.org/html/2409.00089v1 ↩ ↩² ↩³
Watermarking for Large Language Models: A Survey - MDPI, https://www.mdpi.com/2227-7390/13/9/1420 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
A Watermark for Large Language Models - TL;DR With Paper Author - Arize AI, https://arize.com/blog/a-watermark-for-large-language-models/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Content Credentials: Strengthening Multimedia Integrity in the Generative AI Era - Department of Defense, https://media.defense.gov/2025/Jan/29/2003634788/-1/-1/0/CSI-CONTENT-CREDENTIALS.PDF ↩ ↩²
Scott Aaronson (UT Austin) Simons Workshop on Alignment, Trust, Watermarking, and Copyright Issues in LLMs Berkeley, CA, October 17, 2024, https://simons.berkeley.edu/sites/default/files/2024-10/LLM24-2%20Slides%20-%20Scott%20Aaronson.pdf ↩
A Survey on Detection of LLMs-Generated Content - ACL Anthology, https://aclanthology.org/2024.findings-emnlp.572.pdf ↩ ↩² ↩³ ↩⁴ ↩⁵
[2301.10226] A Watermark for Large Language Models - arXiv, https://arxiv.org/abs/2301.10226 ↩
GumbelSoft: Diversified Language Model Watermarking via the GumbelMax-trick - arXiv, https://arxiv.org/html/2402.12948v3 ↩ ↩² ↩³ ↩⁴
Double-I Watermark: Protecting Model Copyright for LLM Fine-tuning | OpenReview, https://openreview.net/forum?id=ecbRyZZmKG ↩ ↩² ↩³ ↩⁴ ↩⁵
Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding | Request PDF - ResearchGate, https://www.researchgate.net/publication/356458417_Adversarial_Watermarking_Transformer_Towards_Tracing_Text_Provenance_with_Data_Hiding ↩ ↩² ↩³ ↩⁴
[2009.03015] Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding - arXiv, https://arxiv.org/abs/2009.03015 ↩ ↩²
S-Abdelnabi/awt: Code for our S&P'21 paper: Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding - GitHub, https://github.com/S-Abdelnabi/awt ↩ ↩² ↩³
A Survey of Text Watermarking in the Era of Large Language Models - Jingjing Li, https://jingjingli.net/publication/acm-surveys-2023/ ↩
A Survey of Text Watermarking in the Era of Large Language Models - arXiv, https://arxiv.org/pdf/2312.07913 ↩ ↩² ↩³
A Watermark for Large Language Models - arXiv, http://arxiv.org/pdf/2301.10226 ↩
On the Reliability of Watermarks for Large Language Models - OpenReview, https://openreview.net/forum?id=DEJIDCmWOz ↩ ↩²
[Literature Review] WaterPark: A Robustness Assessment of Language Model Watermarking - Moonlight, https://www.themoonlight.io/en/review/waterpark-a-robustness-assessment-of-language-model-watermarking ↩
WaterPark: A Robustness Assessment of Language ... - Zian Wang, https://zianwang.com/attaches/WaterPark.pdf ↩
MarkLLM: An Open-Source Toolkit for LLM Watermarking - ACL ..., https://aclanthology.org/2024.emnlp-demo.7/ ↩
[2505.05190] Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks - arXiv, https://arxiv.org/abs/2505.05190 ↩
A Survey of Text Watermarking in the Era of Large Language Models - arXiv, https://arxiv.org/html/2312.07913v4 ↩
Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks, https://icml.cc/virtual/2025/poster/44537 ↩ ↩²
Toward Reliable Provenance in AI-Generated Content: Text, Images, and Code - Medium, https://medium.com/@adnanmasood/toward-reliable-provenance-in-ai-generated-content-text-images-and-code-9ebe8c57ceae ↩ ↩²
Waterfall: Scalable Framework for Robust Text Watermarking and Provenance for LLMs - ACL Anthology, https://aclanthology.org/2024.emnlp-main.1138.pdf ↩
[2410.19096] Watermarking Large Language Models and the Generated Content: Opportunities and Challenges - arXiv, https://arxiv.org/abs/2410.19096 ↩
(PDF) Hardware assisted watermarking for multimedia - ResearchGate, https://www.researchgate.net/publication/222299343_Hardware_assisted_watermarking_for_multimedia ↩
Pattern-based quantum text watermarking: Securing digital content with next-Gen quantum techniques - PMC, https://pmc.ncbi.nlm.nih.gov/articles/PMC11625354/ ↩
How it works - Content Authenticity Initiative, https://contentauthenticity.org/how-it-works ↩
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? | Proceedings of the AAAI Conference on Artificial Intelligence, https://ojs.aaai.org/index.php/AAAI/article/view/34684 ↩
Can Watermarks be Used to Detect LLM IP Infringement For Free? - OpenReview, https://openreview.net/forum?id=KRMSH1GxUK ↩
Identifying AI generated content in the digital age: The role of ... - EY, https://www.ey.com/content/dam/ey-unified-site/ey-com/en-in/insights/ai/documents/ey-identifying-ai-generated-content-in-the-digital-age-the-role-of-watermarking.pdf ↩
Why LLM watermarking will never work | by David Gilbertson - AI Advances, https://ai.gopubby.com/why-llm-watermarking-will-never-work-1b76bdeebbd1 ↩ ↩²