Explainable CTR Prediction via LLM Reasoning (2025)

Xiaohan YuHuawei Cloud BUBeijingChinayuxiaohan5@huawei.com,Li ZhangInstitute of Finance Technology, UCLCivil, Environmental and Geomatic Engineering, UCLUnited Kingdomucesl07@ucl.ac.ukandChong ChenHuawei Cloud BUBeijingChinachenchong55@huawei.com

(2025)

Abstract.

Recommendation Systems have become integral to modern user experiences, but lack transparency in their decision-making processes. Existing explainable recommendation methods are hindered by reliance on a post-hoc paradigm, wherein explanation generators are trained independently of the underlying recommender models. This paradigm necessitates substantial human effort in data construction and raises concerns about explanation reliability. In this paper, we present ExpCTR, a novel framework that integrates large language model based explanation generation directly into the CTR prediction process. Inspired by recent advances in reinforcement learning, we employ two carefully designed reward mechanisms, LC alignment, which ensures explanations reflect user intentions, and IC alignment, which maintains consistency with traditional ID-based CTR models. Our approach incorporates an efficient training paradigm with LoRA and a three-stage iterative process. ExpCTR circumvents the need for extensive explanation datasets while fostering synergy between CTR prediction and explanation generation. Experimental results demonstrate that ExpCTR significantly enhances both recommendation accuracy and interpretability across three real-world datasets.

Large Language Models, Explainability, Recommendation System

copyright: acmlicensedjournalyear: 2025doi: XXXXXXX.XXXXXXXconference: The 18th ACM International Conference on Web Search and Data Mining; March 10–14,2025; Hannover, Germanyisbn: 978-1-4503-XXXX-X/18/06ccs: Do Not Use This CodeGenerate the Correct Terms for Your Paperccs: Do Not Use This CodeGenerate the Correct Terms for Your Paperccs: Do Not Use This CodeGenerate the Correct Terms for Your Paperccs: Do Not Use This CodeGenerate the Correct Terms for Your Paper

1. Introduction

Recommendation Systems (RS) have become a cornerstone of modern user experiences, empowering users to discover relevant and personalized items or contents (Jannach etal., 2010). Collaborative methods (Rendle, 2010; Mnih and Salakhutdinov, 2007; He etal., 2017) have been dominant in this field, leveraging user-item interaction data for future predictions. While these methods, ranging from simple collaborative approaches to deep neural networks, have demonstrated remarkable efficacy in predicting user engagement, particularly in tasks such as click-through rate (CTR) prediction, they often operate as ”black boxes”, offering recommendations without explaining the underlying rationale (Zhang etal., 2020a).The imperative for transparency and accountability has given rise to the burgeoning of explainable recommendation, which moves beyond mere suggestions by providing justifications. Such explanations provide numerous benefits: building user trust and satisfaction, enhancing persuasiveness, and enabling effective debugging and refinement (Tintarev, 2007).Currently, the prevailing approach to explainable recommendation relies on a post-hoc paradigm, where explanations are generated independently of the recommendation model after its predictions are made. These methods necessitate substantial human effort to curate external training datasets through customer review processing or handcrafted rules to produce human-readable explanations.

Recently, Large Language Models (LLMs) have emerged as a powerful tool in natural language processing, demonstrating exceptional reasoning capabilities. Their potential to generate human-readable explanations for complex tasks is particularly promising for explainable recommendation.Studies such as PETER (Li etal., 2021b) and RecExplainer (Lei etal., 2023) have explored integrating item and user latent representations into pre-trained language models, harnessing collaborative information to enhance explanation generation. Other researchers probe the innate reasoning capabilities of LLMs for recommendation tasks using in-context learning techniques (Liu etal., 2023a). Chat-Rec (Gao etal., 2023) has showcased the potential of LLMs for improving explainability in multi-round conversational contexts.Despite these promising developments, these approaches still rely on post-hoc explanations. They either over-rely on enhancing existing methods by substituting traditional language models with transformer-based LLMs or utilize basic zero-shot generation capabilities. Consequently, research on explainable recommendation with LLMs remains in its infancy. As illustrated in Figure 1, several critical challenges persist:

  • Resource intensity. Developing high-quality training datasets for explanation generators is resource-intensive, demanding substantial human effort. While customer reviews present a potential source of pseudo-explanations, they necessitate meticulous curation, extraction, and reformulation to yield training samples. Alternatively, methods like Chat-Rec necessitate extensive human involvement through interactive dialogues.

  • Explanation quality unreliability: The post-hoc paradigm introduces potential discrepancies between the generated explanations and the underlying operations of recommender systems.Current methodologies typically employ a unidirectional information flow, where latent representations or prediction results are passed from the recommender model to a separate explanation generator 1.This unidirectional process lacks mechanisms for quality assessment or feedback from the generated explanations to the existing recommender system.Consequently, there is no assurance that the produced explanations accurately reflect the recommender’s internal decision-making process.

Explainable CTR Prediction via LLM Reasoning (1)

In light of the aforementioned challenges, we propose ExpCTR, a novel approach that aims to operate in a data-free manner, while fostering synergy between CTR prediction and LLM-based explanation generation. Our method seamlessly integrates LLM-driven explanation generation with the CTR prediction process. Drawing inspiration from recent advancements in reinforcement learning (Ouyang etal., 2022), we employ real-world feedback signals to refine the LLM’s reasoning capabilities to better align with the objectives of CTR prediction.

ExpCTR involves a carefully crafted prompt template, tailored to fully elicit the LLM’s reasoning capabilities through a chain-of-thought prompting strategy. Subsequently, we utilize a proximal policy optimization (PPO) algorithm that incorporates two distinct reward mechanisms: (1) LC alignment reward, which ensures that the produced explanations accurately reflect user intentions and preferences, as assessed by an LLM-based CTR predictor. (2) IC alignment reward, which treats the explanations as a textual input feature for a traditional ID-based CTR model, ensuring that the explanations are consistent with the model’s internal mechanisms and predicted outcomes. These two rewards collectively incentivize the LLMs to generate explanations that are both human-centric and recommender-aligned.To accommodate the reward designs, we devise a specific training paradigm that leverages LoRA for LLM lightweight fine-tuning. This paradigm is based on a three-stage iterative process, consisting of aligning with user interactions with LC alignment reward, training a CTR model with textual features, and aligning with the recommender system’s internal mechanisms with IC alignment reward. These stages are iteratively repeated to progressively improve the ExpCTR’s performance.Our approach effectively circumvents the need for extensive explanation data construction and fosters collaboration between LLM-driven explainability and accurate CTR prediction. By deepening the understanding of user preferences and the recommendation mechanism, ExpCTR shows the potential to significantly enhance both interpretability and recommendation effectiveness.Our key contributions can be summarized as follows:

  • We introduce ExpCTR, an innovative framework that enhances the reasoning capabilities of LLMs to generate precise explanations that are closely aligned with CTR models. This approach simultaneously improves CTR prediction performance and RS interpretability. To the best of our knowledge, this represents the first attempt to leverage LLMs for this dual purpose without dependence on extensive data resources.

  • We develop a reinforcement learning based approach to efficiently fine-tune LLMs using LoRA. Our approach integrates two meticulously designed reward mechanisms within a tailored three-stage training paradigm.

  • We conduct a comparative analysis of ExpCTR against several state-of-the-art CTR prediction methods and evaluate the quality of the generated explanations, demonstrating the effectiveness of our method.

2. Related Work

Explainable recommendation (ER) extends traditional recommendation systems by addressing the ”why” behind suggested items. ER provides not only item recommendations but also justifications clarifying the rationale for those suggestions (Zhang etal., 2020b). Current methods can be broadly classified into two categories, model-intrinsic and post-hoc. Model-intrinsic methods aim for inherent explainability by leveraging interpretable algorithms (Zhang etal., 2020a).Conversely, post-hoc approaches leverage black-box models for recommendation, followed by a separate explanation model that deciphers the reasoning behind the recommendations.The rise of deep neural networks has propelled post-hoc methods to the forefront, transforming explainable recommendation into a natural language generation task.Early works rely on pre-defined templates or association rules (Wang etal., 2018; Gao etal., 2019). Later advancements adopt Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) architectures for generating textual explanations (Li etal., 2017; Zhang etal., 2023). With the advent of the Transformer architecture, researchers have explored their potential for explanation generation (Li etal., 2021b). (Yang etal., 2024) incorporates reinforcement learning techniques to address potential issues like hallucinations.Despite these advancements, these approaches rely on generators trained independently with carefully curated explanation datasets. Given the scarcity of user-item-explanation triplets in real-world RS, substantial efforts have been dedicated to constructing high-quality explanation datasets. Techniques such as word overlap analysis (Li etal., 2021b), LSH-based near-duplicate detection (Li etal., 2021a) and a combination of manual and automatic reformulation on dialogue datasets (Guo etal., 2023) have been employed to this end.

Recently, the burgeoning field of LLMs has spurred research on LLM-based explainable recommendation, which still predominantly employs post-hoc approaches. For instance, (Gao etal., 2023) generates explanations in a zero-shot manner within a conversational scenario. (Liu etal., 2023a) probes the innate reasoning capabilities of LLMs for recommendation tasks using in-context learning techniques. However, these approaches heavily rely on LLM’s intrinsic reasoning capabilities, with the recommender system remaining unaware of the generated explanations, let alone assessing their quality. This raises concerns about their effectiveness and the accuracy of the produced justifications in reflecting the true reasoning behind recommendations.This paper aims to address these limitations by proposing a novel approach that ensures coherent and reliable explanations directly integrated within the recommendation process.

3. Preliminary

3.1. Problem Definition

Let 𝒰={u1,u2,,un}𝒰subscript𝑢1subscript𝑢2subscript𝑢𝑛\mathcal{U}=\{u_{1},u_{2},\ldots,u_{n}\}caligraphic_U = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } denote a set of n𝑛nitalic_n users and ={i1,i2,,im}subscript𝑖1subscript𝑖2subscript𝑖𝑚\mathcal{I}=\{i_{1},i_{2},\ldots,i_{m}\}caligraphic_I = { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } a set of m𝑚mitalic_m items. The user-item interaction data 𝒟𝒟\mathcal{D}caligraphic_D is represented by a binary interaction matrix {0,1}n×msuperscript01𝑛𝑚\mathcal{R}\in\{0,1\}^{n\times m}caligraphic_R ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT, where u,isubscript𝑢𝑖\mathcal{R}_{u,i}caligraphic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT indicates whether user u𝑢uitalic_u has interacted with item i𝑖iitalic_i. A value of 1 signifies explicit feedback (e.g., watching videos, clicking) and 0 otherwise. Each interaction is associated with a textual review eu,isubscript𝑒𝑢𝑖e_{u,i}italic_e start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT. The objective of explainable recommendation is to jointly predict future user interactions and generate explanations for these predictions. We formulate this as a probabilistic model:

(1)P(𝒵,y^|𝒟)𝑃𝒵conditional^𝑦𝒟\displaystyle P(\mathcal{Z},\hat{y}|\mathcal{D})italic_P ( caligraphic_Z , over^ start_ARG italic_y end_ARG | caligraphic_D )

where 𝒵𝒵\mathcal{Z}caligraphic_Z represents the set of explanations for all user-item pairs and y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG denotes the predicted interaction scores.

3.2. Theoretical Basis of ExpCTR

To generate post-hoc explanations, we decompose the joint probability as follows:

(2)P(𝒵,y^|𝒟)=P(y^|𝒟)CTR ModelP(𝒵|y^,𝒟)Generator.𝑃𝒵conditional^𝑦𝒟subscript𝑃conditional^𝑦𝒟CTR Modelsubscript𝑃conditional𝒵^𝑦𝒟𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟\displaystyle P(\mathcal{Z},\hat{y}|\mathcal{D})=\underbrace{P(\hat{y}|%\mathcal{D})}_{\text{CTR Model}}\cdot\underbrace{{P(\mathcal{Z}|\hat{y},%\mathcal{D})}}_{Generator}.italic_P ( caligraphic_Z , over^ start_ARG italic_y end_ARG | caligraphic_D ) = under⏟ start_ARG italic_P ( over^ start_ARG italic_y end_ARG | caligraphic_D ) end_ARG start_POSTSUBSCRIPT CTR Model end_POSTSUBSCRIPT ⋅ under⏟ start_ARG italic_P ( caligraphic_Z | over^ start_ARG italic_y end_ARG , caligraphic_D ) end_ARG start_POSTSUBSCRIPT italic_G italic_e italic_n italic_e italic_r italic_a italic_t italic_o italic_r end_POSTSUBSCRIPT .

We first train a CTR model f:𝒰×:𝑓𝒰f:\mathcal{U}\times\mathcal{I}\rightarrow\mathbb{R}italic_f : caligraphic_U × caligraphic_I → blackboard_R. This model learns latent representations 𝐡i,jsubscript𝐡𝑖𝑗\mathbf{h}_{i,j}bold_h start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT from user-item interaction and side information (e.g., user demographics, item features). The optimization process is formulated as:

(3)min(u,i)𝒟CTR(y^,y),subscript𝑢𝑖𝒟subscript𝐶𝑇𝑅^𝑦𝑦\displaystyle\min\sum_{(u,i)\in\mathcal{D}}\mathcal{L}_{CTR}(\hat{y},y),roman_min ∑ start_POSTSUBSCRIPT ( italic_u , italic_i ) ∈ caligraphic_D end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_T italic_R end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG , italic_y ) ,

where y𝑦yitalic_y denotes the ground truth interactions for user-item pairs.We define a generator g:𝒰××𝒱:𝑔𝒰𝒱g:\mathcal{U}\times\mathcal{I}\times\mathbb{R}\rightarrow\mathcal{V}italic_g : caligraphic_U × caligraphic_I × blackboard_R → caligraphic_V that explains why user u𝑢uitalic_u might interact positively or negatively with item i𝑖iitalic_i. This model generates explanations conditioned on the predicted result y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG. The generator is optimized as:

(4)min(u,i)𝒟k=1|e¯u,i|logp(tk|t<k,y^),subscript𝑢𝑖𝒟superscriptsubscript𝑘1subscript¯𝑒𝑢𝑖𝑝conditionalsubscript𝑡𝑘subscript𝑡absent𝑘^𝑦\displaystyle\min\sum_{(u,i)\in\mathcal{D}}\sum_{k=1}^{|\overline{e}_{u,i}|}-%\log p(t_{k}|t_{<k},\hat{y}),roman_min ∑ start_POSTSUBSCRIPT ( italic_u , italic_i ) ∈ caligraphic_D end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | over¯ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT - roman_log italic_p ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG ) ,

where e¯u,isubscript¯𝑒𝑢𝑖\overline{e}_{u,i}over¯ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT denotes the processed customer reviews used as explanation samples (Chen etal., 2019; Li etal., 2021b, a).Post-hoc methodologies exhibit a critical dependency on curated training datasets, which fundamentally shapes the conditional distribution P(𝒵|y^,𝒟)𝑃conditional𝒵^𝑦𝒟P(\mathcal{Z}|\hat{y},\mathcal{D})italic_P ( caligraphic_Z | over^ start_ARG italic_y end_ARG , caligraphic_D ). Another limitation lies in that the generated explanations exert no influence on the CTR model, thereby failing to guarantee that the explanations faithfully reflect the underlying mechanisms of the CTR model.To address this, we integrate CTR prediction and explanation generation within a unified framework, leveraging LLMs, which can be mathematically expressed as:

(5)P(𝒵,y^|𝒟)=P(𝒵|𝒟)LLMP(y^|𝒵,𝒟)CTR Model.𝑃𝒵conditional^𝑦𝒟subscript𝑃conditional𝒵𝒟LLMsubscript𝑃conditional^𝑦𝒵𝒟CTR Model\displaystyle P(\mathcal{Z},\hat{y}|\mathcal{D})=\underbrace{P(\mathcal{Z}|%\mathcal{D})}_{\text{LLM}}\cdot\underbrace{P(\hat{y}|\mathcal{Z},\mathcal{D})}%_{\text{CTR Model}}.italic_P ( caligraphic_Z , over^ start_ARG italic_y end_ARG | caligraphic_D ) = under⏟ start_ARG italic_P ( caligraphic_Z | caligraphic_D ) end_ARG start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ⋅ under⏟ start_ARG italic_P ( over^ start_ARG italic_y end_ARG | caligraphic_Z , caligraphic_D ) end_ARG start_POSTSUBSCRIPT CTR Model end_POSTSUBSCRIPT .

We employ a LLM to generate explanations, circumventing the need for constructing a high-quality explanation dataset – a laborious and costly task. The CTR model, P(y^𝒵,𝒟)𝑃conditional^𝑦𝒵𝒟P(\hat{y}\mid\mathcal{Z},\mathcal{D})italic_P ( over^ start_ARG italic_y end_ARG ∣ caligraphic_Z , caligraphic_D ), depends on the generated explanations 𝒵𝒵\mathcal{Z}caligraphic_Z, thus establishing a direct link between the explanations and their impact on CTR predictions. This approach represents a significant departure from traditional post-hoc methods, as the generated explanations are not simply after-the-fact rationalizations but integral components of the recommendation decision-making process.

Concretely, we adapt the CTR model to incorporate the generated explanations as features, denoted by f¯:𝒰××𝒱:¯𝑓𝒰𝒱\overline{f}:\mathcal{U}\times\mathcal{I}\times\mathcal{V}\rightarrow\mathbb{R}over¯ start_ARG italic_f end_ARG : caligraphic_U × caligraphic_I × caligraphic_V → blackboard_R. The prediction is then computed as follows:

(6)y^=f¯(,𝒵|ΘCTR).^𝑦¯𝑓conditional𝒵subscriptΘCTR\displaystyle\hat{y}=\overline{f}(\mathcal{R},\mathcal{Z}|\Theta_{\text{CTR}}).over^ start_ARG italic_y end_ARG = over¯ start_ARG italic_f end_ARG ( caligraphic_R , caligraphic_Z | roman_Θ start_POSTSUBSCRIPT CTR end_POSTSUBSCRIPT ) .
Explainable CTR Prediction via LLM Reasoning (2)

4. Methodology

Figure 2 depicts the overall architecture of ExpCTR. It consists of three primary components: (1) Explanation Generation leverages a LLM to produce textual explanations for recommendations. (2) Reward Design utilizes CTR prediction processes to provide quality assessments for the generated explanations that serve as reward signals. (3) Training Paradigm introduces Lora lightweight fine-tuning techniques along with an iterative training process.

4.1. Explanation Generation

Traditional recommendation systems rely on implicit representations of users and items and suffer from a lack of interpretability (FerrariDacrema etal., 2019). However, recent advancements in LLMs have demonstrated extensive world knowledge and advanced reasoning capabilities (Liu etal., 2023b; Peng etal., 2023; Wu etal., 2023). These capabilities offer a promising avenue for human-interpretable explanation generation. To harness this potential, we design a prompt template to guide the LLM to generate effective explanations. The prompt leverages user historical interaction data and frames the LLM as a helpful recommendation assistant:

The prompt template, originally developed for a book recommendation task (Yu etal., 2024b, a), can be easily adapted to different recommendation scenarios with minor adjustments. Items (e.g., ¡item_1_1¿\ldots) are represented by their titles, while users are characterized by the titles of items they have interacted with. To further refine user profiles, we categorize these historical items into liked and disliked categories, using interaction signals such as ratings as indicators. A threshold-based function is employed to classify each item in a user’s interaction sequence. Items rated above the threshold are considered ’liked’, while those below are deemed ’disliked’. The threshold is a hyperparameter adjusted based on dataset characteristics.

This structured prompt empowers the LLM to effectively utilize its knowledge base to infer more nuanced user preferences from the liked and disliked items and generate rationales for why a user might like or dislike a particular target item. By applying this methodology to all user-item interaction pairs 𝒟𝒟\mathcal{D}caligraphic_D, we obtain a set of explanations 𝒵𝒵\mathcal{Z}caligraphic_Z, where each explanation 𝒵u,i𝒵subscript𝒵𝑢𝑖𝒵\mathcal{Z}_{u,i}\in\mathcal{Z}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT ∈ caligraphic_Z corresponds to a specific user-item pair.

4.2. Reward Design

We optimize the LLM through quality assessment of the generated explanations. Inspired by InstructGPT (Ouyang etal., 2022), we leverage a reinforcement learning paradigm to achieve this objective with a well-designed reward function that incentivizes the LLM to generate informative explanations for CTR prediction and accurately represent the underlying user motivations behind their interactions.

4.2.1. Proximal Policy Optimization

Following (Ouyang etal., 2022), we adopt the proximal policy optimization (PPO) (Schulman etal., 2017) algorithm for the reinforcement learning process. Given a prompt and response (explanation), the LLM produces a reward determined by a reward function, concluding the episode. The objective function for PPO training is formulated as:

(7)objectiveϕ=subscriptobjectiveitalic-ϕabsent\displaystyle\text{objective}_{\phi}=objective start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT =E(x,𝒵)DπϕRL[R(𝒵)βlogπϕRL(𝒵|x)πinit(𝒵|x)],subscript𝐸similar-to𝑥𝒵subscript𝐷superscriptsubscript𝜋italic-ϕRLdelimited-[]𝑅𝒵𝛽superscriptsubscript𝜋italic-ϕRLconditional𝒵𝑥superscript𝜋initconditional𝒵𝑥\displaystyle E_{(x,\mathcal{Z})\sim D_{\pi_{\phi}^{\text{RL}}}}\left[R(%\mathcal{Z})-\beta\log\frac{\pi_{\phi}^{\text{RL}}(\mathcal{Z}|x)}{\pi^{\text{%init}}(\mathcal{Z}|x)}\right],italic_E start_POSTSUBSCRIPT ( italic_x , caligraphic_Z ) ∼ italic_D start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R ( caligraphic_Z ) - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT ( caligraphic_Z | italic_x ) end_ARG start_ARG italic_π start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT ( caligraphic_Z | italic_x ) end_ARG ] ,

where x𝑥xitalic_x denotes the prompt template for explanation generation, as detailed in Section 4.1. πinitsuperscript𝜋init\pi^{\text{init}}italic_π start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT is the initial LLM and πϕRLsuperscriptsubscript𝜋italic-ϕRL\pi_{\phi}^{\text{RL}}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT represents the fine-tuned explanation generation language model to be optimized.β𝛽\betaitalic_β is the KL penalty and R(𝒵)𝑅𝒵R(\mathcal{Z})italic_R ( caligraphic_Z ) is our reward function.

The core concept behind our reward function design is to leverage real-world CTR prediction feedback to enhance explanation quality. We aim to incentivize explanations that accurately reflect user intent and preferences. Crucially, these explanations must also align with the underlying mechanics and predicted outcomes of the CTR models employed by the recommender system. To accomplish this, we decompose the reward function into two key components: Explanation and LLM-CTR Alignment, and Explanation and ID-CTR Alignment. These components will be elaborated in the following sections.

4.2.2. Explanation and LLM-CTR Alignment (LC Alignment)

This component evaluates the effectiveness of the LLM’s generated explanation towards accurately inferring the intended user behavior. A high LC alignment reward signifies that the explanation successfully conveys the underlying factors influencing user interaction. We operationalize LC alignment reward by leveraging recent advancements in CTR prediction with LLMs. We frame the task as a binary classification problem, where the LLM predicts whether a user will like a given item (e.g., book) based on their rationales (the generated explanation by LLM in Section 4.1).

Specifically, we design a prompt template to guide the LLM towards predicting CTR. This template provides context for the LLM, including the user’s thoughts about the item and a binary response option (”Yes” or ”No”) indicating their decision:

In this template, ¡reason¿ is replaced with the explanation 𝒵𝒵\mathcal{Z}caligraphic_Z from Section 4.1. Formally, we define the predicted CTR score for a user-item pair (u,i)𝑢𝑖(u,i)( italic_u , italic_i ) as:

(8)su,iu=exp(p(t0=𝒱pos)/T)exp(p(t0=𝒱pos)/T)+exp(p(t0=𝒱neg)/T),subscriptsuperscript𝑠𝑢𝑢𝑖𝑒𝑥𝑝𝑝subscript𝑡0subscript𝒱𝑝𝑜𝑠𝑇𝑒𝑥𝑝𝑝subscript𝑡0subscript𝒱𝑝𝑜𝑠𝑇𝑒𝑥𝑝𝑝subscript𝑡0subscript𝒱𝑛𝑒𝑔𝑇\displaystyle s^{u}_{u,i}=\frac{exp({p(t_{0}=\mathcal{V}_{pos})/T})}{exp({p(t_%{0}=\mathcal{V}_{pos})/T})+exp({p(t_{0}=\mathcal{V}_{neg})/T})},italic_s start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e italic_x italic_p ( italic_p ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_V start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) / italic_T ) end_ARG start_ARG italic_e italic_x italic_p ( italic_p ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_V start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) / italic_T ) + italic_e italic_x italic_p ( italic_p ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_V start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) / italic_T ) end_ARG ,

where p(t0=𝒱pos)𝑝subscript𝑡0subscript𝒱𝑝𝑜𝑠p(t_{0}=\mathcal{V}_{pos})italic_p ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_V start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) is the probability of the first generated token t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by LLM that equals 𝒱possubscript𝒱𝑝𝑜𝑠\mathcal{V}_{pos}caligraphic_V start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT.T𝑇Titalic_T is the temperature for softmax function and 𝒱pos={"Yes"},𝒱neg={"No"}formulae-sequencesubscript𝒱𝑝𝑜𝑠"𝑌𝑒𝑠"subscript𝒱𝑛𝑒𝑔"𝑁𝑜"\mathcal{V}_{pos}=\{"Yes"\},\mathcal{V}_{neg}=\{"No"\}caligraphic_V start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT = { " italic_Y italic_e italic_s " } , caligraphic_V start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT = { " italic_N italic_o " }.

A closer alignment between the CTR prediction and the ground-truth label indicates a more precise explanation, demonstrating the ability to capture the actual factors influencing user’s decisions. We formulate the LC alignment reward as:

(9)RLC(𝒵ui)=1|yu,isu,iu|.subscript𝑅𝐿𝐶subscript𝒵𝑢𝑖1subscript𝑦𝑢𝑖subscriptsuperscript𝑠𝑢𝑢𝑖\displaystyle R_{LC}(\mathcal{Z}_{ui})=1-|y_{u,i}-s^{u}_{u,i}|.italic_R start_POSTSUBSCRIPT italic_L italic_C end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ) = 1 - | italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT | .

However, directly using this reward function might lead to unstable gradients due to potential variations in reward scales across different batches (Zheng etal., 2023). We introduce a normalization and clipping procedure to ensure that reward values are appropriately scaled and bounded:

(10)RLCnorm(𝒵ui)=clip(RLC(𝒵ui)mean(RLC(𝒵ui))std(RLC(𝒵ui)),δ),superscriptsubscript𝑅𝐿𝐶𝑛𝑜𝑟𝑚subscript𝒵𝑢𝑖clipsubscript𝑅𝐿𝐶subscript𝒵𝑢𝑖meansubscript𝑅𝐿𝐶subscript𝒵𝑢𝑖stdsubscript𝑅𝐿𝐶subscript𝒵𝑢𝑖𝛿\displaystyle R_{LC}^{norm}(\mathcal{Z}_{ui})=\text{clip}\left(\frac{R_{LC}(%\mathcal{Z}_{ui})-\text{mean}\left({R_{LC}}(\mathcal{Z}_{ui})\right)}{\text{%std}(R_{LC}(\mathcal{Z}_{ui}))},\delta\right),italic_R start_POSTSUBSCRIPT italic_L italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_r italic_m end_POSTSUPERSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ) = clip ( divide start_ARG italic_R start_POSTSUBSCRIPT italic_L italic_C end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ) - mean ( italic_R start_POSTSUBSCRIPT italic_L italic_C end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ) ) end_ARG start_ARG std ( italic_R start_POSTSUBSCRIPT italic_L italic_C end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ) ) end_ARG , italic_δ ) ,

where mean(RLC(𝒵ui))meansubscript𝑅𝐿𝐶subscript𝒵𝑢𝑖\text{mean}\left({R_{LC}}(\mathcal{Z}_{ui})\right)mean ( italic_R start_POSTSUBSCRIPT italic_L italic_C end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ) ) and std(RLC(𝒵ui))stdsubscript𝑅𝐿𝐶subscript𝒵𝑢𝑖\text{std}(R_{LC}(\mathcal{Z}_{ui}))std ( italic_R start_POSTSUBSCRIPT italic_L italic_C end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ) ) denotes the mean and standard deviation of the rewards across a batch. The clip function constrains the normalized reward within a predefined bound δ𝛿\deltaitalic_δ.

4.2.3. Explanation and ID-CTR Alignment (IC Alignment)

The congruence between generated explanations and the CTR model is quantitatively assessed by evaluating their contribution to CTR predictions. A positive reward value potentially signifies that the explanation provides substantial insights into the underlying patterns driving CTR. To rigorously evaluate this alignment, we integrate the generated explanations directly into the existing CTR prediction architecture. This integration serves a dual purpose: evaluating explanatory quality and potentially enhancing predictive accuracy by leveraging latent information within the explanations.

Our approach first obtains a dense textual representation for each generated explanation 𝒵u,isubscript𝒵𝑢𝑖\mathcal{Z}_{u,i}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT by employinga pre-trained language model (PLM), fencoder:𝒱d:subscript𝑓𝑒𝑛𝑐𝑜𝑑𝑒𝑟𝒱superscript𝑑f_{encoder}:\mathcal{V}\rightarrow\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_e italic_n italic_c italic_o italic_d italic_e italic_r end_POSTSUBSCRIPT : caligraphic_V → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, to map the explanation text into a unified semantic space. This enables the capture of underlying meaning and relationships within the explanation. fencodersubscript𝑓𝑒𝑛𝑐𝑜𝑑𝑒𝑟f_{encoder}italic_f start_POSTSUBSCRIPT italic_e italic_n italic_c italic_o italic_d italic_e italic_r end_POSTSUBSCRIPT can be any frozen pre-trained language model, such as BERT (Devlin etal., 2018), BGE (Xiao etal., 2023) and we derive the dense representation for the explanation as follows:

(11)𝐳u,i=MeanPooling(fencoder(𝒵u,i)),subscript𝐳𝑢𝑖MeanPoolingsubscript𝑓𝑒𝑛𝑐𝑜𝑑𝑒𝑟subscript𝒵𝑢𝑖\displaystyle\mathbf{z}_{u,i}=\text{MeanPooling}(f_{encoder}(\mathcal{Z}_{u,i}%)),bold_z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT = MeanPooling ( italic_f start_POSTSUBSCRIPT italic_e italic_n italic_c italic_o italic_d italic_e italic_r end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT ) ) ,

where 𝐳u,isubscript𝐳𝑢𝑖\mathbf{z}_{u,i}bold_z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT denotes the mean pooling hidden representations from the last layer in PLM.

Subsequently, we integrate these textual representations with an original ID-based CTR model architecture. This integration facilitates the learning of a joint representation that combines user and item information with the insights provided by the explanation. We propose a simple yet effective concatenation operation to achieve this integration. Specifically, the textual representation 𝐳u,isubscript𝐳𝑢𝑖\mathbf{z}_{u,i}bold_z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT is concatenated with the hidden representation 𝐡u,isubscript𝐡𝑢𝑖\mathbf{h}_{u,i}bold_h start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT (defined in Section 3.2) and fed into the existing CTR model to predict the CTR score as follows:

(12)su,ir=f¯(Concate(𝐡u,i,𝐳u,i)),subscriptsuperscript𝑠𝑟𝑢𝑖¯𝑓Concatesubscript𝐡𝑢𝑖subscript𝐳𝑢𝑖\displaystyle s^{r}_{u,i}=\overline{f}(\text{Concate}(\mathbf{h}_{u,i},\mathbf%{z}_{u,i})),italic_s start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT = over¯ start_ARG italic_f end_ARG ( Concate ( bold_h start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT ) ) ,

where f¯¯𝑓\overline{f}over¯ start_ARG italic_f end_ARG (as in Equation 6) can be any original ID-based model architecture, such as DeepFM (Lian etal., 2018). To evaluate the impact of semantic representations of LLM’s explanations, we compare the performance of the CTR model with and without these explanations and quantify the differences in CTR predictions. A notable performance improvement when explanations are incorporated indicates that the introduced semantic features contribute positively. This implies a better-aligned explanation, justifying a higher reward. This evaluation is formalized as follows:

(13)RIC(𝒵u,i)=1|yu,isu,ir|+|su,irs~u,ir|,subscript𝑅𝐼𝐶subscript𝒵𝑢𝑖1subscript𝑦𝑢𝑖subscriptsuperscript𝑠𝑟𝑢𝑖subscriptsuperscript𝑠𝑟𝑢𝑖subscriptsuperscript~𝑠𝑟𝑢𝑖\displaystyle R_{IC}(\mathcal{Z}_{u,i})=1-|y_{u,i}-s^{r}_{u,i}|+|s^{r}_{u,i}-%\tilde{s}^{r}_{u,i}|,italic_R start_POSTSUBSCRIPT italic_I italic_C end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT ) = 1 - | italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT | + | italic_s start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT | ,

where s~u,irsubscriptsuperscript~𝑠𝑟𝑢𝑖\tilde{s}^{r}_{u,i}over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT indicates the CTR prediction score obtained without using explanations as input features, by setting the representations 𝐳u,isubscript𝐳𝑢𝑖\mathbf{z}_{u,i}bold_z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT to zero vectors. The IC alignment reward is normalized and clipped, as in Equation 10, resulting in RICnorm(𝒵u,i)subscriptsuperscript𝑅𝑛𝑜𝑟𝑚𝐼𝐶subscript𝒵𝑢𝑖R^{norm}_{IC}(\mathcal{Z}_{u,i})italic_R start_POSTSUPERSCRIPT italic_n italic_o italic_r italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_C end_POSTSUBSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT ).

By synergizing the effects of these two reward components during the LLM training phase, we aim to ensure that the generated explanations not only accurately capture user rationales behind their behavior but also contribute meaningfully to the performance of CTR models. This approach promotes explanations that are both faithful and informative, ultimately leading to a more robust and interpretable recommendation system.

4.3. Training Paradigm

4.3.1. Light Weight Tuning

To mitigate the computational training burden associated with three independent LLMs - the initial LLM πinitsuperscript𝜋𝑖𝑛𝑖𝑡\pi^{init}italic_π start_POSTSUPERSCRIPT italic_i italic_n italic_i italic_t end_POSTSUPERSCRIPT, explanation generator πϕRLsuperscriptsubscript𝜋italic-ϕRL\pi_{\phi}^{\text{RL}}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT and the LC alignment reward model, we adopt a lightweight tuning approach.Recent findings (Houlsby etal., 2019; Li and Liang, 2021; Hu etal., 2021) suggest that LLMs can be effectively compressed without significant performance degradation, owing to the inherently lower-dimensional nature of the information they encode.Leveraging this insight, we employ Low-Rank Adapters (Lora) (Hu etal., 2021) to optimize our training process which introduces trainable low-rank matrices into each transformer layer, allowing for efficient parameterization while preserving model performance.Specifically, we employ a base LLM as both the initial model and the frozen LC alignment reward model. The explanation generator is instantiated as the base LLM with Lora.Crucially, this strategy drastically reduces the parameters. By consolidating computations into a single LLM with a minimal number of trainable parameters in Lora, we achieve substantial computational efficiency without compromising model quality.

4.3.2. Iterative Training

This section outlines the iterative training methodology employed to optimize the explanation generation model, πϕRLsuperscriptsubscript𝜋italic-ϕRL\pi_{\phi}^{\text{RL}}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT, considering both LC alignment and IC alignment rewards. Our approach involves a three-stage iterative training process, alternating between component-specific optimization phases.

Stage 1. LC Alignment. We commence by utilizing a frozen language model πinitsuperscript𝜋𝑖𝑛𝑖𝑡\pi^{init}italic_π start_POSTSUPERSCRIPT italic_i italic_n italic_i italic_t end_POSTSUPERSCRIPT to compute LC rewards. The explanation generation model πϕRLsuperscriptsubscript𝜋italic-ϕRL\pi_{\phi}^{\text{RL}}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT is then optimized using the LC alignment reward:

(14)R(𝒵)=RLCnorm(𝒵ui).𝑅𝒵superscriptsubscript𝑅𝐿𝐶𝑛𝑜𝑟𝑚subscript𝒵𝑢𝑖R({\mathcal{Z}})=R_{LC}^{norm}(\mathcal{Z}_{ui}).italic_R ( caligraphic_Z ) = italic_R start_POSTSUBSCRIPT italic_L italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_r italic_m end_POSTSUPERSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ) .

This phase establishes a foundational understanding of ”correct” explanations, aligning the model with user preferences and intentions, and producing factually sound explanations.This stage persists for a predetermined number of iterations, during which we continuously accumulate fresh explanations for each user-item pair.

Stage 2. CTR Model Training with Textual Features. Following Stage 1, we accumulate a corpus of generated explanations. These explanations are integrated as textual features with the original ID dataset for training the CTR model f¯¯𝑓\overline{f}over¯ start_ARG italic_f end_ARG:

(15)min(u,i,𝒵u,i){𝒟,𝒵}CTR(su,ir,yu,i).subscript𝑢𝑖subscript𝒵𝑢𝑖𝒟𝒵subscript𝐶𝑇𝑅subscriptsuperscript𝑠𝑟𝑢𝑖subscript𝑦𝑢𝑖\displaystyle\min\sum_{(u,i,\mathcal{Z}_{u,i})\in\{\mathcal{D},\mathcal{Z}\}}%\mathcal{L}_{CTR}(s^{r}_{u,i},y_{u,i}).roman_min ∑ start_POSTSUBSCRIPT ( italic_u , italic_i , caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT ) ∈ { caligraphic_D , caligraphic_Z } end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_T italic_R end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT ) .

Stage 3. IC Alignment. In the final stage, we further refine the explanation generation model by incorporating the IC alignment reward:

(16)R(𝒵)=RICnorm(𝒵ui).𝑅𝒵superscriptsubscript𝑅𝐼𝐶𝑛𝑜𝑟𝑚subscript𝒵𝑢𝑖\displaystyle R({\mathcal{Z}})=R_{IC}^{norm}(\mathcal{Z}_{ui}).italic_R ( caligraphic_Z ) = italic_R start_POSTSUBSCRIPT italic_I italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_r italic_m end_POSTSUPERSCRIPT ( caligraphic_Z start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ) .

Stages 2 and 3 are then repeated for a predefined number of iterations, allowing for continuous model refinement and performance improvement.Upon the completion of the training process, we obtain a robustly trained explanation generation model πϕRLsuperscriptsubscript𝜋italic-ϕRL\pi_{\phi}^{\text{RL}}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT and a CTR model f¯¯𝑓\overline{f}over¯ start_ARG italic_f end_ARG that effectively leverages textual features for prediction.This unified training approach prioritizes that generated explanations are informative and likely to resonate with both users and the CTR model, avoiding the pitfall of producing generic or uninformative content.

ModelsBookCrossingML-20MAmazon Books
AUCLogLossMAERMSEAUCLogLossMAERMSEAUCLogLossMAERMSE
FM0.51760.69500.49910.50090.62310.66250.47020.48490.56670.68760.49500.4972
DeepFM0.52220.69240.49870.49960.62120.67080.47620.48910.56390.68730.49560.4971
AutoInt0.51760.69220.49900.49950.63240.66360.46180.48510.56780.68680.49570.4968
PNN0.52570.69260.49940.49970.63630.67860.45750.49170.57330.68600.49380.4964
xDeepFM0.51240.69800.49990.50240.64400.66280.46450.48510.56400.69120.49050.4987
FiGNN0.52110.69220.49890.49970.64160.67620.47510.49170.56730.68930.49190.4980
DCN0.51980.69540.49870.50110.62500.68910.44650.49420.54870.71200.49060.5082
DCNV20.51950.71880.49430.51130.61340.68000.48060.49350.54330.72300.49070.5132
DIN0.51320.76600.49310.53080.60330.75400.44490.51340.51200.89680.49450.5600
DIEN0.52310.73530.50300.52000.60740.67980.46540.49250.50960.91110.49620.5572
CASER0.52080.69310.49970.50000.64070.68360.49490.49520.52050.69190.49910.4994
GRU4Rec0.53561.36500.49110.56640.64030.69020.49850.49850.52830.69300.49990.4999
SASRec0.53221.16340.48640.57710.61970.66950.48170.48820.51810.71450.49320.5072
BERT4Rec0.51361.07170.50170.60750.58660.67890.48850.49290.52980.75470.49230.5130
TALLRec0.53890.69290.49690.50050.66600.65410.47260.48040.57440.68680.49550.4968
ICL0.56630.73280.48290.51740.63200.67540.45560.49080.59300.72460.47150.5133
ExpCTR-LLM0.60420.69430.47340.49990.67070.64280.45230.47490.62900.68310.47200.4946
ExpCTR-Aug0.61730.67150.48000.48910.69510.63890.42100.47100.66410.64930.45570.4783

5. Experiment

In this section, we detail the experimental setup to evaluate the performance of ExpCTR. We aim to address the following research questions through a series of rigorous experiments and analyses:

  • RQ1: How does ExpCTR compare to existing state-of-the-art approaches in terms of generating explanations for recommendation decisions and improving CTR prediction?

  • RQ2: How effective is the integration of the PPO algorithm in ExpCTR?

  • RQ3: How does the quality of the explanations produced by our framework measure up?

5.1. Experimental Setting

5.1.1. Datasets

To comprehensively evaluate the effectiveness and generalizability of our proposed framework, we leverage three publicly available, large-scale datasets: BookCrossing 111https://www.kaggle.com/datasets/somnambwl/bookcrossing-dataset, MovieLens-20M 222https://grouplens.org/datasets/movielens/20m/, and Amazon Books 333https://jmcauley.ucsd.edu/data/amazon/. Following (Bao etal., 2023),we employ a stratified random sampling approach for each user within each dataset. Specifically, we randomly select one item a user interacted with as the target item for prediction. The remaining interacted items, up to a maximum of 10 items chronologically preceding the target item, are considered the user’s historical interactions. Then, we partition the constructed data samples into training, validation, and testing sets with a ratio of 8:1:1. For datasets containing rating scores, we binarize the ratings using a threshold where ratings above the threshold are considered positive interactions (items the user liked), while ratings below are considered negative interactions. Specifically, the threshold for the ML-20M and Amazon Books datasets is 4, and 5 for the BookCrossing dataset (Song etal., 2019; Zhou etal., 2019). Finally, Amazon Books and ML-20M comprise 16,000/2,000/2,000 while BookCrossing comprises 32,000/4,000/4,000 data samples.

5.1.2. Compared Methods

For CTR evaluations of ExpCTR, we leverage two distinct scoring mechanisms: LLM scores derived from the LC alignment module, designated as ”ExpCTR-LLM”, and CTR scores with explanations as textual features from the IC alignment module, referred to as ”ExpCTR-Aug”. ExpCTR-LLM reflects the effectiveness of the generated explanations in capturing and articulating user preferences and rationales for future interactions, which results in better outcomes under an LLM scorer.Conversely, a superior ExpCTR-Aug score suggests that the explanation aligns well with the internal workings of ID-based CTR models and provides supplementary information that enhances performance. This dual evaluation approach provides an indirect yet effective method for assessing explanation quality.We compare ExpCTR against diverse established baseline models, encompassing both ID-based and LLM-based recommendation methods:

  • ID-based methods: Factorization Machines (FM) (Rendle, 2010) captures pairwise feature interactions for recommendation tasks. Deep learning models, including DSSM (Huang etal., 2013), DeepFM (Lian etal., 2018), AutoInt (Song etal., 2019), PNN (Qu etal., 2016), Fi-GNN (Li etal., 2019), DCN (Wang etal., 2017), DCNV2 (Wang etal., 2021), utilize multi-layer perceptrons, self-attention mechanisms, and graph neural networks to effectively capture both low-order and high-order feature interactions to enhancing recommendation accuracy. DIN (Zhou etal., 2018) and DIEN (Zhou etal., 2019) leverage attention mechanisms to extract user dynamic interests from their historical behavior sequences. Caser (Tang etal., 2016), GRU4Rec (Hidasi etal., 2016), SASRec (Kang and McAuley, 2018) and BERT4Rec (Sun etal., 2019) are sequential-based recommendation models that employ Convolutional Neural Networks (CNNs), Gated Recurrent Units (GRUs), and transformer-encoder architectures for robust user behavior modeling, respectively, leading to more accurate recommendations.

  • LLM-based methods: In-Context Learning (ICL) for Recommendation (Dai etal., 2023) leverages an LLM for recommendations by directly posing queries to the LLM. TALLRec (Bao etal., 2023) adapts LLMs to recommendation scenarios through instruction tuning.

5.1.3. Metrics

To assess the effectiveness of ExpCTR, we utilize multiple regular CTR prediction metrics (Lian etal., 2018; Zhou etal., 2018). Specifically, we evaluate performance using the Area Under the ROC Curve (AUC), binary cross-entropy loss (Log Loss), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).

5.1.4. Implementation Details

In our experimental setup, we employ LLaMA-3-7b as the foundational model for both explanation generation and LC alignment reward computation. For explanations encoding, we employ BGE-small (Xiao etal., 2023). The IC alignment reward is built upon the DeepFM and implemented through the open-source project Recbole (Zhao etal., 2021). Our optimization incorporates a learning rate of 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, with a KL penalty of 0.05. The reward clip threshold δ𝛿\deltaitalic_δ is set to 1.0. The iterative training paradigm consists of two epochs per iteration.For TALLRec, we leverage the entire training dataset and apply a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. All experiments are conducted on a single machine equipped with NVIDIA A800 GPUs.

5.2. Performance Comparison (RQ1)

Table 1 presents a comparative analysis of our proposed method with existing ID-based CTR methods and LLM-based methods. The results yield several noteworthy observations:

  • Baseline models such as ICL and TALLRec demonstrate strong performance across all datasets, particularly when compared to ID-based methods. This suggests that the LLMs possess a robust foundational capability for reasoning and comprehension. Nevertheless, ExpCTR-LLM consistently surpasses these two LLM-based CTR models on all metrics and datasets. This empirical evidence indicates that our generated explanations accurately reflect and describe user behavior patterns, leading to significant performance improvements over ICL, which uses the same frozen LLM scorer, and TALLRec, which is finetuned directly under the CTR prediction task.These findings highlight ExpCTR’s capability to leverage the intrinsic reasoning capabilities of LLMs effectively. Additionally, the consistent performance gains underscore the efficacy of our proposed framework, with the integration of reinforcement learning and the LC alignment reward function further extending and motivating the potential of LLMs in recommendation scenarios.

  • ExpCTR-Aug emerges as a substantial advancement over ExpCTR-LLM, demonstrating superior performance across all evaluated datasets. This result highlights the pivotal role of the IC alignment reward in augmenting model efficacy. The explanations generated by ExpCTR-Aug offer profound insights into ID-based CTR models, leading to considerable performance improvements compared to the DeepFM baseline, with observed gains of 18.2%, 11.9%, and 17.8% in AUC across the respective datasets.These results underscore the dual advantages of ExpCTR that it not only enhances the interpretability of the recommendation system but also delivers substantial improvements in recommender system accuracy.

5.3. In-depth Analysis of PPO (RQ2)

5.3.1. PPO Reward Analysis

To investigate the efficacy and learning dynamics of the training paradigm in ExpCTR, we conduct a comprehensive analysis of the reward trajectories during the iterative training process. Specifically, we examined the evolution of LC alignment and IC alignment rewards on ML-20M and BookCrossing datasets. The results of this analysis are presented in Figure LABEL:rewards_curve.

Our analysis reveals a consistent upward trend in LC alignment rewards across both datasets, which stabilizes during the final stages of training and coincides with the performance enhancement of ExpCTR-LLM. Notably, the ML-20M dataset exhibits a more pronounced increase, ultimately reaching a higher plateau. This observation aligns with the superior performance obtained on the ML-20M dataset (9.1% improvement in AUC over ICL). These observations suggest that our LC alignment reward mechanism effectively steers LLMs towards generating explanations that are increasingly congruent with user behavior patterns and exhibit a strong correlation with subsequent user interactions.In contrast, the IC alignment stage is characterized by a more fluctuating curve across both datasets, with the alignment stabilizing more rapidly compared to LC alignment. This corresponds to the relatively modest improvement observed in ExpLLM-Aug over ExpCTR-LLM. The BookCrossing dataset experiences a slight decline followed by steady growth, reflecting the refinement process of IC alignment for LLM-based explanation generation.This empirical evidence underscores the effectiveness of our training paradigm in fostering the development of a more user-centric and contextually relevant recommendation system.

5.3.2. Hyperparameter Sensitivity Analysis

We assess the sensitivity of hyperparameters in PPO training, specifically focusing on the KL penalty β𝛽\betaitalic_β and the reward normalization bound δ𝛿\deltaitalic_δ, both of which are crucial for effective PPO training (Schulman etal., 2017). Figure LABEL:hyper shows the performance variations across different hyperparameter settings, with β𝛽\betaitalic_β ranging from [0.01, 0.05, 0.1, 0.5] and δ𝛿\deltaitalic_δ ranging from [0.5, 1.0, 2.0, 5.0], evaluated on the BookCrossing and Amazon Books datasets.For the KL penalty β𝛽\betaitalic_β, its effect on ExpCTR’s performance is notable across both datasets. Specifically, extreme values of β𝛽\betaitalic_β—either too high or too low—detract from the model’s capabilities, with a setting of 0.05 typically yielding the most competitive results. In contrast, the reward normalization bound δ𝛿\deltaitalic_δ shows significant performance variability on the BookCrossing dataset, while remaining stable on the Amazon Books dataset and we choose δ=1.0𝛿1.0\delta=1.0italic_δ = 1.0 for both datasets.

ReviewThe right book, but I would have had to return it for the code that comes with it. My class had already started. All materials should have been included.
ICLBased on the customer’s interest in books about crystals, gemstones, and nursery rhymes, I predict that they will have a neutral or indifferent opinion about ”Real Research: Conducting and Evaluating Research in the Social Sciences” as this book’s topics and focus on rigorous research methods and scientific inquiry may not align with their preferred themes.
OursBased on the customer’s preferences for books like ”Crystal Lore” and ”Legends & My Thrills: The Fascinating History of the World’s Most Powerful Gems and Stones”, I predict that they will likely enjoy ”Real Research: Conducting and Evaluating Research in the Social Sciences” and consider it a strong fit for their preferences because it is a non-fiction book that offers a sense of discovery and exploration, similar to the themes of history and mythology in their preferred books, and involves a sense of mystery and discovery that is also present in the nursery rhyme book, ”Jack and Jill”, which they also felt drawn.
Review\ldotsI am happy to say that the sharp wit, business savvy, and zeal that I experienced in my conversations with Gary come across in Crush It! \ldots Clearly, anyone new to the world of social media will find this book informative, instructive, and easy to read\ldots I suppose I am a great example of the latter, \ldots regularly speak to business and professional groups, and use these techniques every day and I have a page of notes that I made while reading this book.
ICLI predict that this customer will likely dislike ”Crush It!: Why NOW Is the Time to Cash In on Your Passion” because they have shown a preference for practical, non-fiction books focused on leadership and strategy (e.g. ”Surviving to Thriving” and ”The Expert’s Edge”), and ”Crush It!” is a self-help book with a more entrepreneurial and inspirational tone that may not align with their reading interests.
OursBased on the customer’s preferences, I predict that they will likely enjoy ”Crush It!: Why NOW Is the Time to Cash In on Your Passion” and consider it a book that they will like, as it shares a similar tone and style of writing with the books they have liked, such as ”Glissando” and ”The Art of Woo”, and will likely appeal to their interest in topics such as business, self-improvement, and marketing, which is also present in the books they have liked, such as ”Predictably Irrational” and ”Made to Stick”.

5.4. Case Study (RQ3)

To elucidate the efficacy of ExpCTR in generating improved explanations, we present a comparative analysis of explanations produced by ICL and our proposed approach. Table 2 illustrates representative examples, accompanied by actual user reviews to provide real-world context for our analysis.

In the first case, we observe that the user’s attitude towards the targeted book is fundamentally positive. The user’s comment, ”The right book”, indicates approval, while the phrase ”class have already started” represents an extraneous factor beyond the scope of the recommender system. This positive sentiment aligns with the high CTR prediction of 0.8843. ICL gives a negative attitude (”neutral or indifferent opinion”). However, ExpCTR successfully captures this alignment towards the CTR model, generating a recommendation explanation that accurately reflects the user’s probable affinity for the book (”similar to the themes of history and mythology in their preferred books”).The second case demonstrates a more nuanced improvement. The ICL approach erroneously infers a negative attitude (”likely dislike”), contradicting the user’s actual 5.0 rating. In contrast, ExpCTR, enhanced by LC alignment reward training, correctly identifies the positive interaction potential (”likely enjoy”). Furthermore, the explanation generated by our model corresponds closely to the user’s actual thoughts, accurately identifying the book’s themes of ”business, self-improvement, marketing”.

6. Conclusion

In the paper, we present ExpCTR to address the limitations of current post-hoc explainable recommendation methods. By integrating LLM-based explanation generation into the CTR prediction process, ExpCTR eliminates the need for extensive data preparation and mitigates reliability issues. Our approach leverages reinforcement learning to align LLM reasoning with both user preferences and the recommender system’s internal workings.We believe that ExpCTR represents a significant step forward in the field of explainable recommendations and opens up new avenues for future research.

References

  • (1)
  • Bao etal. (2023)Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023.Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems. 1007–1014.
  • Chen etal. (2019)Zhongxia Chen, Xiting Wang, Xing Xie, Tong Wu, Guoqing Bu, Yining Wang, and Enhong Chen. 2019.Co-attentive multi-task learning for explainable recommendation.. In IJCAI, Vol.2019. 2137–2143.
  • Dai etal. (2023)Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun, Xiao Zhang, and Jun Xu. 2023.Uncovering chatgpt’s capabilities in recommender systems. In Proceedings of the 17th ACM Conference on Recommender Systems. 1126–1132.
  • Devlin etal. (2018)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018).
  • FerrariDacrema etal. (2019)Maurizio FerrariDacrema, Paolo Cremonesi, and Dietmar Jannach. 2019.Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM conference on recommender systems. 101–109.
  • Gao etal. (2019)Jingyue Gao, Xiting Wang, Yasha Wang, and Xing Xie. 2019.Explainable recommendation through attentive multi-view learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.33. 3622–3629.
  • Gao etal. (2023)Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. 2023.Chat-rec: Towards interactive and explainable llms-augmented recommender system.arXiv preprint arXiv:2303.14524 (2023).
  • Guo etal. (2023)Shuyu Guo, Shuo Zhang, Weiwei Sun, Pengjie Ren, Zhumin Chen, and Zhaochun Ren. 2023.Towards explainable conversational recommender systems. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2786–2795.
  • He etal. (2017)Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017.Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. 173–182.
  • Hidasi etal. (2016)Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016.Session-based Recommendations with Recurrent Neural Networks. In Proceedings of the 4th International Conference on Learning Representations.
  • Houlsby etal. (2019)Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin DeLaroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019.Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790–2799.
  • Hu etal. (2021)EdwardJ Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021.Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685 (2021).
  • Huang etal. (2013)Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013.Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333–2338.
  • Jannach etal. (2010)Dietmar Jannach, Markus Zanker, Alexander Felfernig, and Gerhard Friedrich. 2010.Recommender systems: an introduction.Cambridge University Press.
  • Kang and McAuley (2018)Wang-Cheng Kang and Julian McAuley. 2018.Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
  • Lei etal. (2023)Yuxuan Lei, Jianxun Lian, Jing Yao, Xu Huang, Defu Lian, and Xing Xie. 2023.Recexplainer: Aligning large language models for recommendation model interpretability.arXiv preprint arXiv:2311.10947 (2023).
  • Li etal. (2021a)Lei Li, Yongfeng Zhang, and Li Chen. 2021a.Extra: Explanation ranking datasets for explainable recommendation. In Proceedings of the 44th International ACM SIGIR conference on Research and Development in Information Retrieval. 2463–2469.
  • Li etal. (2021b)Lei Li, Yongfeng Zhang, and Li Chen. 2021b.Personalized transformer for explainable recommendation.arXiv preprint arXiv:2105.11601 (2021).
  • Li etal. (2017)Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017.Neural rating regression with abstractive tips generation for recommendation. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. 345–354.
  • Li and Liang (2021)XiangLisa Li and Percy Liang. 2021.Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190 (2021).
  • Li etal. (2019)Zekun Li, Zeyu Cui, Shu Wu, Xiaoyu Zhang, and Liang Wang. 2019.Fi-gnn: Modeling feature interactions via graph neural networks for ctr prediction. In Proceedings of the 28th ACM international conference on information and knowledge management. 539–548.
  • Lian etal. (2018)Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018.xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1754–1763.
  • Liu etal. (2023b)Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023b.Evaluating the logical reasoning ability of chatgpt and gpt-4.arXiv preprint arXiv:2304.03439 (2023).
  • Liu etal. (2023a)Junling Liu, Chao Liu, Peilin Zhou, Renjie Lv, Kang Zhou, and Yan Zhang. 2023a.Is chatgpt a good recommender? a preliminary study.arXiv preprint arXiv:2304.10149 (2023).
  • Mnih and Salakhutdinov (2007)Andriy Mnih and RussR Salakhutdinov. 2007.Probabilistic matrix factorization.Advances in neural information processing systems 20 (2007).
  • Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal. 2022.Training language models to follow instructions with human feedback.Advances in neural information processing systems 35 (2022), 27730–27744.
  • Peng etal. (2023)Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023.Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277 (2023).
  • Qu etal. (2016)Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016.Product-based neural networks for user response prediction. In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 1149–1154.
  • Rendle (2010)Steffen Rendle. 2010.Factorization machines. In 2010 IEEE International conference on data mining. IEEE, 995–1000.
  • Schulman etal. (2017)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017.Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017).
  • Song etal. (2019)Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019.Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM international conference on information and knowledge management. 1161–1170.
  • Sun etal. (2019)Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019.BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
  • Tang etal. (2016)Jiaxi Tang, Ke Wang, Liqiang Zhang, Shuai Li, Jiajie Yan, and Zheng Zhang. 2016.Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 427–436.
  • Tintarev (2007)Nava Tintarev. 2007.Explanations of recommendations. In Proceedings of the 2007 ACM conference on Recommender systems. 203–206.
  • Wang etal. (2018)Nan Wang, Hongning Wang, Yiling Jia, and Yue Yin. 2018.Explainable recommendation via multi-task learning in opinionated text data. In The 41st international ACM SIGIR conference on research & development in information retrieval. 165–174.
  • Wang etal. (2017)Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017.Deep & cross network for ad click predictions.In Proceedings of the ADKDD’17. 1–7.
  • Wang etal. (2021)Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021.Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the web conference 2021. 1785–1797.
  • Wu etal. (2023)Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Fabien Scalzo, and Ira Kurtz. 2023.A comparative study of open-source large language models, gpt-4 and claude 2: Multiple-choice test taking in nephrology.arXiv preprint arXiv:2308.04709 (2023).
  • Xiao etal. (2023)Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023.C-Pack: Packaged Resources To Advance General Chinese Embedding.arXiv:2309.07597[cs.CL]
  • Yang etal. (2024)Mengyuan Yang, Mengying Zhu, Yan Wang, Linxun Chen, Yilei Zhao, Xiuyuan Wang, Bing Han, Xiaolin Zheng, and Jianwei Yin. 2024.Fine-Tuning Large Language Model Based Explainable Recommendation with Explainable Quality Reward. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.38. 9250–9259.
  • Yu etal. (2024a)Xiaohan Yu, Li Zhang, Xin Zhao, and Yue Wang. 2024a.Break the ID-Language Barrier: An Adaption Framework for Sequential Recommendation.arXiv preprint arXiv:2411.18262 (2024).
  • Yu etal. (2024b)Xiaohan Yu, Li Zhang, Xin Zhao, Yue Wang, and Zhongrui Ma. 2024b.RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based Recommendation.arXiv preprint arXiv:2402.04527 (2024).
  • Zhang etal. (2023)Jingsen Zhang, Xu Chen, Jiakai Tang, Weiqi Shao, Quanyu Dai, Zhenhua Dong, and Rui Zhang. 2023.Recommendation with causality enhanced natural language explanations. In Proceedings of the ACM Web Conference 2023. 876–886.
  • Zhang etal. (2020a)Yongfeng Zhang, Xu Chen, etal. 2020a.Explainable recommendation: A survey and new perspectives.Foundations and Trends® in Information Retrieval 14, 1 (2020), 1–101.
  • Zhang etal. (2020b)Yongfeng Zhang, Xu Chen, etal. 2020b.Explainable recommendation: A survey and new perspectives.Foundations and Trends® in Information Retrieval 14, 1 (2020), 1–101.
  • Zhao etal. (2021)WayneXin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021.RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. In CIKM. ACM, 4653–4664.
  • Zheng etal. (2023)Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, etal. 2023.Secrets of rlhf in large language models part i: Ppo.arXiv preprint arXiv:2307.04964 (2023).
  • Zhou etal. (2019)Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019.Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol.33. 5941–5948.
  • Zhou etal. (2018)Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Xiao Ma, Yanghui Yan, Han Jin, Han Li, and Kun Gai. 2018.Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068.
Explainable CTR Prediction via LLM Reasoning (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Ray Christiansen

Last Updated:

Views: 6440

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.