How to Find Hidden Pages on Websitesby Elizabeth Mott ; Updated September 13, 2017
In 2016, Google handled over 3.2 trillion search queries, yet the results the search engine provided accounted for only a fraction of the available content online. Much of the information available online isn't accessible by search engines, so you need to use special tools, or investigate websites yourself, to find these hidden pages. Known as the deep web, this hidden information accounts for up to 5,000 times what's available using typical search techniques.
Types of Hidden Content
Websites' hidden pages fall into categories that describe why they remain invisible to search engines.
Some constitute dynamic content, served up only when a visitor issues a specific request on a website that uses database-driven code to present targeted results. As an example, these pages could include shopping results based on specific combinations of product criteria. Search engines are not designed to track and store information stored in these databases. To find these pages, you would have to go to the website and search for the specific information you are looking for, or use a database-oriented search service like Bright Planet.
Some pages don't have links that connect them to searchable sources. Temporary resources, such as multiple versions of under-development websites, can fall into this category, as can poorly-designed websites. For example, if someone created a web page and uploaded it to the website's server, but failed to add a link to it on the website's current pages, no one would know it was there, including the search engines.
Still more pages require log-in credentials to view or reach them, like subscription sites. Web designers designate pages and sections of sites as off limits to search engines, effectively eliminating them from being found through conventional means. To access these pages, you typically need to create an account before you will be given permission to access them.
Using Robots.txt Files
Search engines crawl through the pages on a website and index its content so it can show up in response to queries. When a website owner wants to exclude some portions of her domain from these indexing procedures, she adds the addresses of these directories or pages to a special text file named robots.txt, stored at the root of her site. Because most websites include a robots file regardless of whether they add any exclusions to it, you can use the predictable name of the document to display its contents.
If you type "[domain name]/robots.txt" without the quotation marks into the location line of your browser, replacing "[domain name]" with the site address, the content of the robots file often appears in the browser window after you press the "Enter" key. Entries prefaced with "disallow" or "nofollow" represent parts of the site that remain inaccessible through a search engine.
Do-It-Yourself Website Hacking
In addition to robot.txt files, you can often find otherwise hidden content by typing web addresses for specific pages and folders in your web browser. For example, if you were looking at an artist's website and noticed that each page used the same naming convention – like gallery1.html, gallery2.html, gallery4.html – then you may be able to find a hidden gallery by typing the page "gallery3.html." in your web browser.
Similarly, if you see that the website uses folders to organize pages – like example.com/content/page1.html, with "/content" being the folder – then you may be able to view the folder itself by typing the website and folder, without a page, such as "example.com/content/" in your web browser. If access to the folder hasn't been disabled, then you may be able to navigate through the pages it contains, as well as pages in any sub-folders, to find hidden content.
- Statistic Brain: Google Annual Search Statistics
- UC Berkeley Library: Invisible or Deep Web: What It Is, How to Find It, and Its Inherent Ambiguity
- The New York Times: Exploring a 'Deep Web' That Google Can’t Grasp
- Nielsen Norman Group: Top 10 Information Architecture (IA) Mistakes
- Downloading Hidden Web Content: Alexandros Ntoulas et al.
- Bright Planet: Deep Web: A Primer