Multi-Evidence Fusion for Social Bot Detection Using D-S Evidence Theory: User Behavior and Profile Evidence

In the era of computational propaganda, the pervasive influence of social bots powered by advanced robot technology has raised significant concerns regarding misinformation dissemination and cognitive manipulation. As an automated program designed to mimic human users on social media platforms, social bots can generate content and engage in interactions at scale, often with malicious intent to sway public opinion or disrupt discourse. The accurate and robust detection of these entities is critical for safeguarding ideological integrity and national security. Traditional detection methods predominantly rely on single-evidence approaches based on cross-sectional data, such as account metadata, text content, or social network structures. However, these methods often overlook the temporal dynamics of user behavior and the potential benefits of multi-evidence fusion, leading to suboptimal accuracy and stability. In this article, I propose a novel framework that integrates user behavior sequences and user profile metadata using D-S evidence theory to enhance social bot detection. By leveraging the complementary strengths of these evidence types, this approach addresses the limitations of existing methods and demonstrates superior performance in terms of precision and reliability.

The proliferation of robot technology in social media has exacerbated the challenges of distinguishing between human users and automated bots. Social bots can emulate human-like characteristics in account profiles, content generation, and social interactions, making them increasingly difficult to detect with conventional techniques. For instance, during major political events or crises, bots have been deployed to amplify specific narratives or create artificial consensus, as seen in the Brexit referendum and the COVID-19 infodemic. Existing detection systems, such as those based on machine learning or graph neural networks, often focus on static features extracted from a single data modality. While these approaches have achieved some success, they are vulnerable to evasion tactics employed by sophisticated bots that can mimic human behavior over short periods. This underscores the need for methods that capture behavioral anomalies over time and combine multiple evidence sources to reduce false positives and negatives.

To address these gaps, my work introduces a multi-evidence fusion framework grounded in D-S evidence theory, which is well-suited for handling uncertain and conflicting information from diverse sources. The core innovation lies in extracting and encoding user behavior sequences from historical post data, complementing traditional profile-based features. User behavior sequences, represented as digital DNA, capture the temporal patterns of actions such as original posts, retweets, and replies, along with inter-action time intervals. These sequences are processed using N-gram models to extract features that reveal bot-like behaviors, such as excessive posting frequencies or regular intervals indicative of automation. Simultaneously, user profile metadata, including follower counts, posting rates, and account attributes, are leveraged to form a complementary evidence source. By training separate classifiers on these feature sets and fusing their outputs via D-S evidence theory, the framework achieves a more holistic and reliable detection mechanism. Experimental results on the TwiBot-22 dataset validate the effectiveness of this approach, showing significant improvements in accuracy and F1-score compared to single-evidence methods and state-of-the-art baselines.

Related Work and Theoretical Foundations

The detection of social bots has evolved considerably with advancements in robot technology and artificial intelligence. Early approaches primarily relied on account metadata, such as the number of followers, friends, and posting frequency, to identify anomalies. For example, systems like BotOrNot employ random forests on a wide range of features to classify bots, but they often struggle with generalization due to the static nature of these features. Similarly, content-based methods analyze text sentiment, similarity, or linguistic patterns to flag suspicious accounts, yet they can be circumvented by bots generating human-like text using natural language generation techniques. More recently, graph-based methods have gained prominence for their ability to model social relationships and topological structures. Techniques using graph neural networks (GNNs) excel at capturing homophily and heterophily in social networks, but they require extensive data collection and computational resources, limiting their practicality in real-time scenarios.

A critical limitation of these methods is their reliance on single-evidence sources, which may not fully capture the multifaceted nature of bot behavior. As robot technology advances, bots are becoming adept at mimicking individual features, necessitating the integration of multiple evidence types. Feature-level fusion, where diverse attributes are combined into a single feature vector, has been explored but often faces challenges in handling heterogeneous data modalities. Decision-level fusion, on the other hand, aggregates outputs from multiple classifiers, leveraging their respective strengths. D-S evidence theory is a powerful framework for this purpose, as it models uncertainty and conflict between evidence sources through probability assignments. In D-S theory, a frame of discernment $\Theta$ represents all possible propositions—in this case, $\Theta = \{\text{human}, \text{bot}\}$. Each evidence source assigns basic probability assignments (BPA) to subsets of $\Theta$, and Dempster’s rule combines these assignments to yield a consensus. This theory has been successfully applied in domains like rumor detection and user credibility assessment, making it suitable for social bot detection where evidence from behavior sequences and profiles may conflict or complement each other.

The concept of using behavioral sequences for bot detection is rooted in the observation that bots often exhibit temporal patterns distinct from humans. For instance, studies have shown that bots may post in bursts or at regular intervals, whereas human behavior is more erratic. Digital DNA encoding, inspired by biological sequences, represents user actions as strings of symbols, enabling the application of sequence analysis techniques. By incorporating time intervals between actions, this approach captures dynamic behaviors that static features miss. When combined with profile metadata through D-S fusion, it creates a robust detection system that adapts to the evolving landscape of robot technology.

Methodology

My proposed methodology for social bot detection consists of five layers: data collection, feature extraction, model training, evidence representation, and evidence fusion. Each layer is designed to handle specific aspects of the detection process, culminating in a fused decision that leverages both user behavior sequences and profile metadata.

Data Collection Layer

The data collection layer involves gathering user profile data and historical post data from social media platforms. Profile data includes attributes such as username, avatar type, description, location, follower count, friend count, and statistics on original posts, retweets, and replies. Historical post data encompasses all user activities, including original content, retweets, and comments, recorded in chronological order. To ensure sufficient data for sequence analysis, I exclude users with fewer than 20 posts, as sparse sequences may not yield meaningful patterns. Additionally, I cap the number of posts per user at 1,000 to manage computational complexity. This data forms the foundation for extracting features that characterize both static and dynamic aspects of user behavior.

Feature Extraction Layer

Feature extraction is performed separately for profile metadata and behavior sequences. For profile metadata, I compute 18 features categorized into numerical and categorical types, as summarized in Table 1. Numerical features include raw counts (e.g., followers, posts) and derived metrics such as posting rates per unit time and average engagement (AvgHot), calculated as:

$$ \text{AvgHot} = \frac{\sum_{i=1}^{n} (c_i + r_i + l_i)}{\text{TO} + \text{TR}} $$

where $c_i$, $r_i$, and $l_i$ represent the comment, retweet, and like counts for post $i$, respectively, and TO and TR denote the total original and retweeted posts. Categorical features include avatar type (default, empty, or custom), presence of a description, and location enablement. These features provide a snapshot of user characteristics that are commonly used in bot detection.

Table 1: User Profile Metadata Features
Category	Feature Name	Dimension
Numerical Features	Follower Count	1
	Friend Count	1
	Original Post Count	1
	Retweet Count	1
	Reply Count	1
	Original Posts per Unit Time (O1)	1
	Original Posts per Unit Time (O2)	1
	Retweets per Unit Time (R1)	1
	Retweets per Unit Time (R2)	1
	Replies per Unit Time (C1)	1
	Replies per Unit Time (C2)	1
Average Hotness (AvgHot)	1
Username Character Length	1
Categorical Features	Avatar Type	3
	Has Description	1
	Location Enabled	1

For behavior sequences, I encode user actions into a symbolic string representing the type of action (O for original, R for retweet, C for reply) and the time interval between consecutive actions. Time intervals are discretized into 10 granularity levels based on duration, as shown in Table 2. For example, an interval of less than 5 seconds is not encoded, while intervals of 5 seconds to 10 seconds are encoded as ‘1’, and so on up to intervals exceeding one year encoded as ‘9’. This encoding produces a sequence such as “O 1 C 3 R”, which denotes an original post followed by a reply after 5–10 seconds, then a retweet after 1 minute–1 hour. Such sequences capture temporal patterns that may indicate automated behavior, such as rapid-fire posting or periodic activity.

Table 2: Time Interval Granularity Encoding
Symbol	Time Granularity
No symbol	Δt < 5 seconds
1	5 seconds ≤ Δt < 10 seconds
2	10 seconds ≤ Δt < 1 minute
3	1 minute ≤ Δt < 1 hour
4	1 hour ≤ Δt < 2 hours
5	2 hours ≤ Δt < 1 day
6	1 day ≤ Δt < 1 week
7	1 week ≤ Δt < 1 month
8	1 month ≤ Δt < 1 year
9	1 year ≤ Δt

To extract features from these sequences, I apply the N-gram model with varying window sizes (e.g., N=5 to 10) to capture short-term dependencies. The N-gram process slides a window over the sequence, generating subsequences that are vectorized using CountVectorizer. Additionally, I engineer manual features such as the frequency of specific time granularities, the occurrence of burst sequences (e.g., multiple actions with no pauses), and the presence of regular patterns (e.g., posts at fixed intervals). This results in a 234-dimensional feature vector that encapsulates the dynamic behavior of users, providing a rich source of evidence for bot detection.

Model Training Layer

In the model training layer, I train separate classifiers on the profile metadata features (feature_1) and behavior sequence features (feature_2). To identify the optimal classifier, I evaluate six algorithms: XGBoost, Random Forest, Logistic Regression, Support Vector Machine (SVM), Recurrent Neural Network (RNN), and Convolutional Neural Network (CNN). For each algorithm, I perform hyperparameter tuning via grid search and address class imbalance using SMOTE oversampling with k_neighbors=3 to prevent overfitting. The models are trained on datasets with balanced (1:1) and imbalanced (1:6) ratios of bots to humans, split into training, validation, and test sets in an 8:1:1 ratio. Based on performance metrics such as accuracy (Acc), precision (Pre), recall (Recall), and F1-score, XGBoost consistently outperforms other models for both feature types, as detailed in Table 3. Thus, I select XGBoost as the primary classifier for generating evidence probabilities.

Table 3: Classifier Performance Comparison (Sample Results)
Sample Ratio	Feature Type	Model	Acc	Pre	Recall	F1
1:1	Profile	XGBoost	69.80	69.38	69.80	69.56
		Random Forest	68.17	68.82	68.37	68.23
		Logistic Regression	63.82	65.68	63.82	62.71
	Behavior	XGBoost	69.20	69.20	69.20	69.20
		Random Forest	68.80	67.81	67.81	67.80
		Logistic Regression	68.37	68.93	68.37	68.03
1:6	Profile	XGBoost	78.76	61.95	65.04	64.19
		Random Forest	77.79	61.62	62.30	61.81
		Logistic Regression	74.35	56.64	50.19	53.22
	Behavior	XGBoost	80.30	59.51	59.35	59.43
		Random Forest	66.72	57.79	61.78	59.71
		Logistic Regression	68.86	57.68	65.77	56.41

Evidence Representation Layer

The evidence representation layer converts the predictions from the XGBoost classifiers into basic probability assignments (BPA) for D-S evidence theory. The frame of discernment is defined as $\Theta = \{0, 1\}$, where 0 represents “human” and 1 represents “bot”. For the profile metadata evidence, denoted as $m_1$, the BPA is derived from the classifier’s predicted probability $p_1$ for the bot class:

$$ m_1(A_j) = \begin{cases}
p_1 & \text{if } j = 1 \\
1 – p_1 & \text{if } j = 0
\end{cases} $$

Similarly, for the behavior sequence evidence $m_2$, with predicted probability $p_2$:

$$ m_2(A_j) = \begin{cases}
p_2 & \text{if } j = 1 \\
1 – p_2 & \text{if } j = 0
\end{cases} $$

These assignments capture the degree of belief from each evidence source, accounting for uncertainty in the classifications.

Evidence Fusion Layer

In the evidence fusion layer, I combine $m_1$ and $m_2$ using Dempster’s rule of combination. The combined BPA for a proposition $A$ is given by:

$$ M(A) = (m_1 \oplus m_2)(A) = \frac{\sum_{x \cap y = A} m_1(x) m_2(y)}{1 – k} $$

where $k$ is the conflict measure calculated as:

$$ k = \sum_{x \cap y = \emptyset} m_1(x) m_2(y) $$

The denominator $1 – k$ serves as a normalization factor to ensure that the total probability sums to 1. If $k = 1$, the evidence is entirely conflicting, and fusion is not possible. In practice, this rule effectively reconciles discrepancies between the two evidence sources, enhancing the overall detection confidence. For example, if profile evidence strongly suggests a bot but behavior evidence indicates a human, the fusion process weighs both to reduce misclassification. The final decision is based on the combined probability, with thresholds adjusted to optimize performance metrics.

Experiments and Analysis

To evaluate the proposed framework, I conduct experiments on the TwiBot-22 dataset, a comprehensive benchmark for social bot detection. I extract a subset of 241,500 accounts, including 34,500 bots and 207,000 humans, representing an imbalanced ratio of 1:6. A balanced subset with 34,500 bots and an equal number of randomly selected humans is also created for comparison. Both subsets are split into training, validation, and test sets with an 8:1:1 ratio. The feature extraction process yields profile metadata features (feature_1) and behavior sequence features (feature_2), which are used to train XGBoost classifiers as established in the model training phase.

The experimental results demonstrate the superiority of the multi-evidence fusion approach. For the balanced sample (1:1), the fused model achieves an accuracy of 72.66% and an F1-score of 72.65%, compared to 69.80% and 69.56% for profile evidence alone, and 69.20% and 69.20% for behavior evidence alone. This represents improvements of approximately 2.9% in accuracy and 3.1% in F1-score over single-evidence methods. Similarly, for the imbalanced sample (1:6), the fused model attains an accuracy of 84.60% and an F1-score of 66.27%, outperforming profile evidence (78.76% accuracy, 64.19% F1) and behavior evidence (80.30% accuracy, 59.43% F1). The confusion matrices and ROC curves further illustrate the reduction in false positives and false negatives, validating the fusion mechanism’s ability to correct misclassifications. For instance, in the balanced case, behavior evidence reduces false negatives from profile evidence, while in the imbalanced case, it mitigates false positives.

To assess stability, I perform five repeated experiments with different test splits and compute mean performance metrics with standard errors. As shown in Table 4, the fused model consistently achieves higher accuracy and F1-scores with lower variance, indicating robustness across samples. The ROC curves exhibit larger areas under the curve (AUC) for the fused model, confirming its enhanced discriminatory power.

Table 4: Mean Performance Over Five Experiments (± Standard Error)
Sample Ratio	Model	Acc	Pre	Recall	F1
1:1	Profile Evidence	69.36 ± 0.78	68.96 ± 0.86	67.82 ± 0.82	68.92 ± 0.82
	Behavior Evidence	69.51 ± 0.43	68.32 ± 0.45	68.56 ± 0.44	69.41 ± 0.43
	Fused Evidence	72.27 ± 0.60	73.03 ± 0.51	71.63 ± 0.52	72.05 ± 0.67
1:6	Profile Evidence	79.71 ± 0.53	63.38 ± 0.23	67.78 ± 0.47	64.76 ± 0.13
	Behavior Evidence	80.20 ± 0.09	59.26 ± 0.06	59.08 ± 0.09	59.16 ± 0.06
	Fused Evidence	84.45 ± 0.35	66.25 ± 0.41	66.03 ± 0.34	66.08 ± 0.10

I further compare the fused model against state-of-the-art baseline methods on the TwiBot-22 dataset. For balanced samples, the fused model surpasses graph-based methods like H2GCN (61.65% accuracy, 67.61% F1) and SIRAN (65.67% accuracy, 71.95% F1), demonstrating the efficacy of behavior sequences as a standalone evidence source. For imbalanced samples, the fused model achieves competitive results, with an F1-score of 66.08% outperforming BotBuster (52.26% F1) and BotSCL (61.53% F1), and closely matching recent methods like social heterophily-based detection (65.97% F1). These comparisons underscore the advantage of multi-evidence fusion in handling class imbalance and adapting to the complexities of modern robot technology.

Conclusion

In this article, I have presented a multi-evidence fusion framework for social bot detection that integrates user behavior sequences and profile metadata using D-S evidence theory. The key contributions include the development of a behavior sequence encoding method that captures temporal dynamics often overlooked by traditional approaches, and the application of D-S theory to fuse evidence from multiple sources, thereby enhancing detection accuracy and stability. Experimental results on the TwiBot-22 dataset confirm that behavior sequences serve as a potent evidence source, effectively complementing profile metadata to reduce false positives and negatives. The fused model consistently outperforms single-evidence methods and state-of-the-art baselines, particularly in imbalanced scenarios, highlighting its practical utility in real-world applications.

The proposed framework offers several advantages: it relies on easily accessible data from user profiles and historical posts, avoiding the computational overhead of social graph analysis; it is modular and extensible, allowing for the incorporation of additional evidence types such as text content or network features; and it leverages D-S theory to handle uncertainty and conflict, making it resilient to evolving robot technology. Future work will focus on extending the framework to multi-modal evidence fusion, including text and visual data, and adapting it to emerging threats posed by generative AI. As social bots continue to evolve, the integration of behavioral时序 analysis and evidence fusion will be crucial for maintaining the integrity of online spaces and countering computational propaganda. This approach not only advances the field of social bot detection but also provides a scalable solution for platform regulators and policymakers seeking to mitigate the risks associated with automated influence operations.