Advanced search engine alchemy: Exposing hidden data with dorking, Shodan, and Censys
Unlocking files, servers, and sensitive data hidden in plain sight
Investigative discoveries are not often on the first page of a search. Advanced search tactics, sometimes known as “dorking,” are very important because they cut through the noise of ordinary search results, which are often full of commercial content.
They help journalists find and disclose files, database dumps, sensitive system configurations, and misconfigured assets that were accidentally made public but are still indexed by search engines.
This feature turns a regular search engine into a powerful forensic tool that can help find key evidence for tales about financial fraud, government mistakes, and cybersecurity breaches.
👁️ Goals for Learning
When you’re done with this tutorial, you’ll be able to:
Learn how to utilise complex Boolean operators and Google Dorks to find specific file kinds and server information.
Use Google Custom Search Engines (CSEs) and syntax that is specific to each platform to search social media and speciality sites like Telegram.
Use Shodan and Censys to explore the open internet, identify an organisation’s digital footprint, and find services available to the public.
Set up a chain of custody for digital evidence to make sure it stays safe and can be used in court.
Use AI technologies responsibly to look at big datasets while lowering the chance of hallucination.
💡 Basic Ideas
Important Words
Google Dorking:
Using advanced search operators (such as filetype: and inurl:) to find specific, often private, information that search engines have made public.
Google CSE (Custom Search Engine):
A search engine that uses Google’s index but only searches a certain set of domains (for example, just .gov sites or a selected list of Telegram channels).
Passive OSINT:
Getting publicly available information without directly interacting with the target system (for example, utilising Google Dorks, Shodan, or CSEs). This is the best and safest way to do things.
Shodan/Censys:
These are specialised search engines that scan and index the Internet of Things (IoT) and connected devices. They provide you with information on open ports, services, banners, and data on hardware and software versions.
Chain of Custody:
A very important written method follows digital evidence from collection to final presentation, ensuring it is real and hasn’t been altered.
⚠️ Important for Journalists: Moral and Legal Limits
Privacy and Consent
The “Stop at the Login” Rule is the most important rule for ethical OSINT: Don’t try to get around any password, login prompt, or access protection. Anyone can see publicly indexed content, and it’s lawful to use a Dork to access it. It is against the law and wrong to try to log in, scan, or probe a system directly to get access. Even if sensitive data is made public, it must be treated with the utmost care regarding the privacy of those who are not public figures.
Things to think about legally
It is okay to utilise tools like Shodan and Censys to look for their publicly available data. Still, you should know that scanning or port sweeping a target’s network without their consent could break anti-hacking or computer fraud laws (such as the CFAA in the U.S.). Don’t do anything that would be considered unauthorised access or a denial-of-service attack. You can only use these tools to ask questions about the data they have already collected.
Disclaimer: Before looking into any system vulnerabilities or disclosed non-public data, you should talk to your news organisation’s legal counsel.
🛠️ The Method: Putting It into Action Step by Step
Tools and Setup Needed
Dedicated Virtual Machine (VM) or Sandbox Browser: For all of your OSINT work, use a separate environment, either a Linux VM or a dedicated browser profile. This keeps your main computer safe and lowers the chance that your identity will be revealed.
VPN/TOR: To be anonymous and hide your location when searching, use a trusted commercial VPN or the Tor Browser.
Hashing Tool: A simple program that makes SHA-256 hashes, like built-in command-line utilities or a separate hash generator.
Web Archiving Tool: Use services like The Wayback Machine or specialist browser extensions like Hunchly or PageFreezer to save proof.
Accounts: You can get free or paid access to Shodan and Censys to learn more about devices and servers.
👷♀️ Practical execution (The “How”)
Scenario 1: Auditing a Company’s Exposed Internal Documents (Advanced Google Dorking)
Investigative Goal: Locate publicly available PDF, Excel, or PowerPoint files on a target company’s site that contain the keywords “budget,” “confidential,” or “Q3 earnings.”
Scenario 2: Using Google CSEs and Platform Syntax to Mine Social and Niche Platforms
A Google Custom Search Engine (CSE) is an invaluable tool for journalists, as it allows you to apply the power of Google’s operators to a hand-picked list of sites (like a list of activist blogs, a nation’s government websites, or, famously, a list of indexed public social channels).
Scenario 3: Checking the Security Posture of a Target (Shodan and Censys)
These apps work like “search engines for the internet of things,” giving you a quick look at services that are open to the public.
💾 The Audit Trail: Keeping Data Safe
Establishing a Chain of Custody is not optional; it distinguishes an unproven lead from evidence that can be used in court.
Capture the Artefact: Don’t just take screenshots of the search results page. You need to either save the page or download the file. To get the whole page (including the HTML source and headers) with a timestamp that can’t be changed, use a tool like Hunchly or PageFreezer.
Make a Hash: As soon as you download a file (such as a PDF or spreadsheet) or take a picture of a webpage, make a SHA-256 cryptographic hash of it. This one-of-a-kind alphanumeric string shows that the file is safe. Even a single space in the file will change the hash completely.
Windows: Get-FileHash -Algorithm SHA256 C:\path\to\file.pdf.pdf
For Linux and macOS: use shasum -a 256 /path/to/file.pdf
Log the Evidence: Make an unchangeable record of the following in a secure spreadsheet or database:
Date and Time of Collection (in UTC): YYYY-MM-DD HH:MM:SS UTC
Query: The precise query string was used for the investigation.
Source URL: The complete URL of the artefact that was created.
SHA-256 Hash: The hash that was made.
Collector: Your name and investigator ID.
🧠 Check and Analyse for Reporting
Strategy for Corroboration
The first search result is a lead, not a fact. Always check technical results against other sources that are not related to the work:
IP Address: To ensure that an IP address detected by Shodan (for example, one hosting a susceptible database) belongs to or is associated with the target business, it must be checked against Whois information, DNS lookups, or Certificate Transparency logs (like those in Censys).
File Content: You need to check the primary claims and financial numbers in a file you found using Dorking against at least two additional independent sources, such as a public financial disclosure, a company statement, or a separate whistleblower document.
Geolocation: If a dork finds a social media post, you can utilise reverse image search, shadow/sun analysis, and comparison of satellite images (like Google Earth) to make sure the photo is real.
Connecting Data to Story
A lot of people can’t understand technical data. You need to turn it into facts that are easy to understand and verifiable.
🤖 AI Assistance in Analysis
AI/LLMs can be very helpful for analysts, but they should never be utilised as the main source or to check facts.
Summarising Big Documents: Upload a publicly indexed 500-page PDF (such a policy or legal filing) and tell the LLM to find “all mention of dates, people, and financial values over $1M.” This saves hundreds of hours of manual review.
Finding Important People and Grouping Them: Give an LLM a cleaned-up list of 1,000 email addresses you got through dorking and ask it to group them by department (for example, finance@, hr@, support@) to figure out how the company is set up.
Translation: Use AI to quickly translate documents written in other languages, such as a Russian log file or an Arabic contract, so the investigation can proceed smoothly.
⚠️ IMPORTANT WARNING: The risk of hallucination and privacy
Hallucination Risk (NEVER Trust): LLMs are designed to generate language that sounds believable, not to verify facts. They often make up sources, dates, and even whole documents. Every piece of data, number, date, or entity name provided by the AI must be checked by a person against the original source document. You need to access the source file and check if the AI says the date is “March 15, 2025.”
Privacy Warning (NEVER Upload): Don’t ever upload sensitive, source-provided, or non-public data (such as PII) to an AI/LLM service that is open to the public. Public models use your information to train their algorithms, which means that a lot of data is leaked, and your source’s security is put at risk. Only use enterprise models that are safe, self-hosted, or validated and have a policy that guarantees no logging or training for this kind of material.
🚀 Next Steps and Practice
Exercise to Practise
The Policy That Leaks:
Goal: Find public documents on a big, well-known university’s website that talk about “COVID-19 policy” and include a staff member’s entire “personal email address.”
Query: Use the site:, filetype:, and intext: operators to search for .pdf files on a big university domain (like site:harvard.edu) that have the words “COVID-19 policy” AND “@university.edu” in them.
Task: Your first step in making a Chain of Custody log is to download any document you find and make its SHA-256 hash.
📖 More advanced resources
Google Hacking Database (GHDB): A carefully chosen and organised list of Google Dorks that have been able to find and reveal a number of sensitive files and system weaknesses.
Spotlight’s OSINT Toolkit (EBU members only): Tools and links to bespoke CSEs (like the Telegago CSE) for searching on social media and other platforms.
Censys Search Documentation: This is where you can discover the powerful, SQL-like query language that lets you look at internet hosts and certificates in great detail.
Hunchly is a commercial OSINT collecting and reporting program that automates preservation, hashing, and audit trails.
✅ Important things to remember and investigative principles
Precision is power: Advanced operators (site:, filetype:, inurl:, intitle:) change search engines from libraries into tools for surgical intelligence.
Always look for a wall: If you see a login or paywall, stop. You can only use publicly available information in your research. Don’t try to get around access limits.
Document everything (CoC): To keep a verifiable Chain of Custody for every important artefact, you need to record the Query, the URL, the Timestamp, and the SHA-256 Hash.
AI is an assistant, not a source: Use LLMs to summarise data, translate it, and extract entities. Do not trust facts made by AI or give private information to a public model.
Translating technical data: A discovery like “Open MongoDB port” needs to be turned into a clear journalistic fact, such as “Unauthenticated customer database.”
👁️ Coming next week…
📸 Forensic image analysis: EXIF data & error level analysis (ELA)
Without proof in the form of pictures, the infrastructure intelligence you’ve obtained is worthless. The next stage is to learn how to use pictures as digital proof. We will show you how to get EXIF data (location, camera, date) and how to utilise advanced methods like Error Level Analysis (ELA) to find digital picture alteration, photo forgeries, and re-saves. This is very important for checking user-generated content in a crisis.