Detecting bots, trolls, and disinformation campaigns
How to deconstruct botnets, map troll farms, and trace the digital fingerprints of influence operations
Investigating Coordinated Inauthentic Behaviour (CIB)—the organised manipulation of online discourse by botnets and troll farms—is one of the most critical and complex tasks in modern journalism. This tutorial provides the OSINT framework to identify, analyse, and attribute these hidden influence operations.
1.1. The investigative need (The “why”)
State-sponsored disinformation networks, as well as those fuelled by commercial interests, use synthetic amplification to give false narratives the illusion of true popularity. They establish networks of bots (automated accounts) and trolls (human operators) that engage in coordinated activities, such as simultaneous posting, topic hijacking, and targeted harassment.
It is paramount for journalists to distinguish between fake grassroots activity (astroturfing) and genuine public opinion, which will help reporters avoid publishing false premises, accurately report on foreign or domestic interference, and maintain the integrity of the public record.
1.2. Learning outcomes
By completing this tutorial, you will be able to:
Deconstruct Coordinated Inauthentic Behaviour (CIB) into its constituent parts (Actors, Content, Behaviour).
Formulate advanced search queries across platforms (especially X/Twitter and Reddit) to identify temporal and behavioural coordination.
Utilize network analysis principles (social network analysis, or SNA) to map amplifier accounts and key nodes.
Preserve collected digital evidence using Chain of Custody protocols, ensuring data is admissible and verifiable.
Ethically apply AI tools to summarize large datasets for rapid entity and relationship extraction.
1.3. Case study hook
Think of a rapid, organized social media phenomenon flourishing and pushing a false narrative just ahead of a big election. Thousands of newly created accounts all sharing similar language and posting at exactly the same moment propagate this narrative. Your goal is not to just debunk the narrative, but to reveal the centralized source (a troll farm or political consultancy) that coordinated the campaign. This tutorial shows you how to find the digital fingerprints of that coordination.
💡 2. Foundational theory and ethical-legal framework
2.1. Key terminology
Coordinated Inauthentic Behaviour (CIB): An official term for manipulation where multiple accounts or pages work together to deceive people about the identity, purpose, or origin of the entity behind them. Coordination and deception are the key components.
Astroturfing: The practice of masking the sponsors of a message or organization to make it appear as though it originates from and is supported by genuine grassroots participants (e.g., a fake public opinion campaign).
Botnet: A network of automated, often interconnected, accounts (software robots or bots) used to execute coordinated tasks like amplifying a hashtag, sharing links, or performing DDoS attacks.
Social Network Analysis (SNA): A process and set of tools used to visually map and analyze the relationships and flow of information between individual actors (nodes) in a network, revealing centralized amplifiers and coordinated sharing patterns.
Digital Signature: A unique, non-content-based pattern (e.g., common file metadata, shared IP addresses, identical account creation patterns, or specific custom URL shorteners) that links otherwise disparate accounts to a single operator.
⚠️ 2.2. Ethical and legal boundaries
2.2.1. Consent & privacy: The “stop at the login” rule
All intelligence gathered should be from publicly available sources (e.g., public social media posts, public databases).
Rule of thumb: The “Stop at the Login” rule: Never attempt to bypass a login, paywall, or any other security measure. If the information requires unauthorized access, it is not OSINT; it is unauthorized access or hacking, and you should stop immediately.
Privacy warning: If you are reporting on bots or troll networks, focus your reporting on the behaviour and source of coordination, not on the unintentional exposure of private data to possibly manipulated, unwitting, or low-level actors. Where ethically necessary, anonymize.
2.2.2. Legal considerations
The most significant legal risk in this field is the Computer Fraud and Abuse Act (CFAA) in the US, and similar legislation elsewhere, which criminalizes “unauthorized access” to computer systems.
Risk: Using automated scraping tools to bypass a platform’s Terms of Service (ToS) or rate limits, even for public data, can be interpreted as unauthorized access by platform legal teams.
Mandatory Disclaimer: Consult your organization’s legal department before beginning a large-scale data collection operation, especially if that work involves scraping or accessing data via a developer API. This tutorial provides methodological guidance only; it is not legal advice.
🛠️ 3. Applied methodology: Step-by-step practical implementation
3.1. Required tools & setup
👷♀️ 3.2. Practical execution (The “how”)
The core investigative goal is to find accounts that exhibit coordinated and non-human behaviour.
Scenario 1: identifying temporal and content coordination on social media (X/Twitter)
Investigative Goal: Find a cluster of accounts that post the same controversial link or use the same niche hashtag within minutes of each other, suggesting automation or central direction.
Scenario 2: Revealing bot characteristics via network analysis
Use your initial query results (a list of 100+ accounts posting the same link) to feed into a Social Network Analysis (SNA) tool like Gephi.
Extract Data: Use an approved API or specialized tool (like Twint, carefully and ethically) to pull the last 200 tweets for each of the 100 suspicious accounts.
Model the Network: Import this data into Gephi. Create a network where:
Nodes are the accounts.
Edges (connections) are mentions, reposts, or replies.
Analyze Metrics: Look for key SNA metrics:
Degree Centrality: Nodes with an unusually high number of connections (often the central “botmaster” or key amplifier).
Modularity: Distinct, tight clusters of accounts that only talk to each other but not to the leading network (suggests a segmented botnet group).
Behavioural Audit: Check the accounts in the high-centrality clusters when examining bot features: high post frequency (e.g., 50+ posts/day), jumbled alphanumeric usernames (e.g., user7492931), and no profile pictures or bios.
💾 3.3. Data preservation and Chain of Custody
You must assume all collected data will be challenged. A strict Chain of Custody (CoC) should be mandatory.
Collect and archive: Use Hunchly (or a similar tool) to capture every webpage, tweet, profile, and search results page. This preserves the original context, date/time, and URL.
Generate hash values: For every key file (e.g., a spreadsheet of scraped accounts, a screenshot, or the Hunchly/web archive file itself), immediately calculate and record its SHA-256 cryptographic hash value. This mathematical fingerprint proves the file has not been altered since the moment of collection.
Command Line Example (Linux/macOS):
sha256sum [filename.csv] >> CoC_Log.txt
Maintain a log: Create a running, chronological Chain of Custody Log document. This log should detail:
The date and UTC time of collection.
The Investigator (your name/ID).
The Method used (e.g., X Advanced Search, Manual Screenshot).
The Location/URL of the collected data.
The SHA-256 Hash of the resulting file.
🧠 4. Verification, analysis, and editorial integration
4.1. Corroboration strategy
Technical data (e.g., a cluster of accounts posting at 08:00:00 UTC) is insufficient on its own. It requires corroboration with at least two independent sources/methods.
Method 1 (Technical corroboration): Cross-reference the identified key amplifier accounts’ metadata (e.g., profile pictures) using a reverse image search (like Yandex or TinEye) to confirm if the profile image is a stock photo or has been used on other now-suspended bot accounts.
Method 2 (Behavioural corroboration): Use the account creation dates of the bot cluster. If hundreds of accounts were created on the same day or week, cross-reference this date with a known national/global political event to suggest a timeline of orchestration.
4.2. Translating data to narrative
The technical data must be translated into clear, verifiable journalistic facts.
🤖 4.3. AI Assistance in analysis and ethical use
AI/LLMs can be powerful, ethical accelerators for processing the overwhelming volume of data common in disinformation investigations, but they should be used with extreme caution.
Summarizing large documents or log files: Feed an LLM a log file containing thousands of posts and ask for salient themes, repeated phrases, and emotional shifts (sentiment analysis). Prompt Example: “Analyze the attached log of 5,000 tweets. Group the posts by theme, identify the five most common keywords (excluding stopwords), and calculate the percentage of posts exhibiting negative sentiment towards [Target Entity].”
Identifying key entities, dates, and relationships: Use an LLM for Named Entity Recognition (NER) on large text corpuses (e.g., thousands of inauthentic blog comments) to quickly extract all unique names, organizations, and dates mentioned, which can then be clustered into a more traditional analysis tool.
Translation of foreign language material: LLMs are excellent for rapid translation of large volumes of foreign language social media posts or propaganda materials.
⚠️ AI warning: Hallucination and privacy
Hallucination risk: Never publish or present an AI-generated finding as a fact until it has been human fact-checked and verified against the original source data. LLMs are designed to generate plausible text, including false “facts.”
Privacy risk: Do not submit sensitive or non-public data, such as source-provided logs, non-public chat screenshots, or the names of confidential sources, to public LLM services like ChatGPT or Gemini. These services can use your input to train their models, potentially violating source protection and privacy. Use licensed, private, or locally run models for sensitive data.
🚀 5. Practice and resources
5.1. Practice exercise
Challenge: Investigate a potential astroturfing campaign targeting a controversial piece of proposed legislation (e.g., The Fictional Digital Policy Act).
Baseline search: Search X/Twitter for the hashtag #FictionalDigitalPolicyAct and the words “scam” or “threat” over the last week.
Filter for anomaly: Identify accounts that:
Have posted about the hashtag at least 10 times in the last 24 hours.
Were created in the last 6 months (use
since:2025-06-01).Have a follower count below 50.
Cross-platform check: Take a suspicious phrase used by the cluster (e.g., “The bill is a freedom killer!”) and run it as a search query on Reddit to see if the same phrase is being injected into political subreddits.
5.2. Advanced resources
The OSINT Framework: A comprehensive, categorized directory of OSINT tools and resources (used to find niche social media or domain analysis tools).
DFR Lab (Atlantic Council): Documentation and methodologies for tracking digital forensic research in influence operations.
SNA visualization tools (e.g., Gephi): Essential for visualizing and quantifying network structures beyond simple lists of accounts.
GHDB (Google Hacking Database): A repository of advanced search queries (Google Dorks) to uncover accidentally exposed files and data.
Open-Source data scraping libraries (e.g., Python’s BeautifulSoup/Scrapy): Used ethically and compliant with ToS/rate limits for scaled collection of public data.
6. Key takeaways and investigative principles
Prioritize behaviour over content: Focus your investigation on how a message is spread (coordination, timing, rate) rather than just the message’s content.
Always capture, hash, and log: Treat every piece of collected public data as potential evidence. Utilize your dedicated CoC log and generate SHA-256 hashes immediately.
Trace the digital signature: Look for technical commonalities that link disparate accounts (e.g., the same profile photo used elsewhere, identical post timing, shared infrastructure).
Cross-reference in threes: Never conclude based on a single piece of evidence. Corroborate technical findings with behavioural and network analysis data.
“Stop at the Login”: Maintain a strict ethical boundary to ensure your investigation is legal and your findings are admissible and defensible.
👁️ Coming next week…
Tracing corporate ownership: Beneficial & ultimate parent companies
Navigating global corporate registries (e.g., OpenCorporates). Understanding shell companies and identifying beneficial ownership to follow the money trail back to the ultimate, real-world actors behind corporate structures.



