The Invisible Web
by Ernest Sasso, Esquire
Because of the enormity and complexity of the World Wide Web, a vast number of Web pages are not indexed by Web search engines and therefore cannot be retrieved by means of a search on those engines. These resources include a variety of content, encompassing the enormously rich databases of information from companies, universities, government agencies, and other organizations that the Web’s search engines can’t (or won’t) reach, and thus can’t include in their results. There is nothing mysterious or mythical involved in the concept of the invisible Web: the “invisible” merely refers to “invisible to search engines.” A lot of it is covered by the following resources. When searching the invisible Web, plan to locate the category of material you want, then browse. Don’t be too specific in your searches.
Overview of Invisible Web Sites
Direct Search — Search Tools and Directories Excellent, frequently updated, academically-oriented site, which can be either searched or browsed. http://www.freepint.com/gary/direct.htm
Direct Search — Search Tools and Directories Site claims “103,000 searchable databases and specialty search engines.” Contains many useful resources, along with the trivial and mundane but-hard-to-find anywhere else sites, such as individual pages (e.g., news articles), and company catalogs. http://completeplanet.com
FirstGov The official gateway to U.S. government information on the Web. With connections to more than 51 million pages on more than 20,000 federal, state, territorial, and tribal sites. This invaluable tool pulls together a substantial part of the Invisible Web: federal government databases. http://firstgov.gov
GPO Access Provides free online use of over 1500 Federal databases, going beyond what is available on FirstGov. http://www.gpo.gov/su_docs
INFOMINE: Scholarly Internet Resource Collections Academically-oriented, covering lots of resources, with annotations and searching and browsing capabilities. Make sure to click on a broad-category database listed at the bottom of the screen if you want to browse. http://infomine.ucr.edu/search.phtml
A Collection of Special Search Engines Academically-oriented. Tons of links, fairly well organized by subject. From Leiden University, in The Netherlands http://www.leidenuniv.nl/ub/biv/specials.htm
Digital Librarian: a Librarian’s Choice of the Best of the Web Personal choices by a librarian of interesting sites in a number of categories. Doesn’t try to cover something in all areas, and subject categories are non-traditional at times, but a lot of fun to browse and often provides a quick intro to searching in a particular area. http://www.digital-librarian.com
Specialized vs. Invisible Web
There are many specialized search directories on the Web that share characteristics of an Invisible Web site, but are perfectly visible to the search engines. These sites often structured as hierarchal directories, designed as navigation hubs for specific topics or categories of information, and usually offer both sophisticated search tools and the ability to browse a structural directory. But even if these sites consist of hundreds, or even thousands of HTML pages, many aren’t part of the Invisible Web, since search engine spiders generally have no problem finding and retrieving the pages. In fact, these sites typically have an extensive internal link structure that makes the spinder’s job even easier..
Many sites that claim to have large collections of invisible or “deep” Web content actually include many specialized search services that are perfectly visible to search spiders. They make the mistake of equating a sophisticated search mechanism with invisibility.
How can you tell the difference between a specialized vs. Invisible Web resource? Always start by browsing the directory, not searching. Search programs, by their nature, use scripts, and often return results that contain indirect URLs. This does not mean, however, that the site is part of the Invisible Web. It’s simply a byproduct of how some search tools function.
As you begin to browse the directory, click on category links and drill down to a destination URL that leads away from the directory itself. As you’re clicking, examine the links. Do they appear to be direct or indirect URLs? Do you see the telltale signs of a script being executed? IF so, the page is part of the Invisible Web‑even if the destination URLs have no question marks. Why? Because crawlers wouldn’t have followed the links to the destination URLs in the first place.
But if, as you drill down the directory structure, you notice that all of the links contain
direct links, the site is almost certainly part of the visible Web, and can be crawled and indexed by search engines.
Visible vs. Invisible
The Gateway to Educational Materials Project is a directory of collections of high-quality educational resources for teachers, parents, and others involved in education. The Gateway features annotated links to more than 12,000 education resources.
- Structure: Searchable directory, part of the Visible Web. Browsing the categories reveals all links are direct URLs. Although the Gateway’s search tool returns indirect URLs, the direct URLs of the directory structure and the resulting offsite links provide clear linkages for search engine spiders to follow.
The Gateway to Education al Materials vs. AskERIC
The Gateway to Educational Materials
AskERIC allows you to search the ERIC database, the world’s largest source of education information. ERIC contains more than one million citations and abstracts of documents and journal articles on education research and practice.
- Structure: Database, limited browsing of small subsets of the database available. These limited browsable subsets use direct URLs; the rest of the ERIC database is only accessible via the AskERIC search interface, making the contents of the database effectively invisible to search engines.
Very important point: Some of the content in the ERIC database also exists in the form of plain HTML files; for example, articles published in the ERIC digest. This illustrates one of the apparent paradoxes of the Invisible Web. Just because a document is located in an Invisible Web database doesn’t mea there aren’t other copies of the document existing elsewhere on the visible Web sites. The key point is that the database containing the original content is the authoritative source, and searching the database will provide the highest probability of retrieving a document. Relying on a general-purpose search engine to find documents that may have copies on visible Web sites is unreliable.
INTA Trademark Checklist vs. Delphion Intellectual Property Network
INTA Trademark Checklist
Delphion Intellectual Property Network
The International Trademark Association (INTA) Trademark Checklist is designed to assist authors, writers, journalists/editors, proofreaders, and fact checkers with proper trademark usage. It includes listings for nearly 3,000 registered trademarks and service marks with their generic terms and indicates capitalization and punctuation.
- Structure: Simple HTML pages, broken into five extensively cross-linked pages of alphabetical listings. The flat structure of the pages combined with the extensive cross-linking make these pages extremely visible to the search engines.
The Delphion Intellectual Property Network allows you to search for, view, and analyze patent documents and many other types of intellectual property records. It provides free access to a wide variety of data collection s and patent information including United States patents, European patents and patent applications, PCT application data from the World Intellectual Property Office, Patent Abstracts of Japan, and more.
- Structure: Relation al database, browsable, but links are indirect and rely on scripts to access information from the database. Data contained in the Delphion Intellectual Property Network database is almost completely invisible to Web search engines.
Key point: Patent searching and analysis is a very complex process. The tools provided by the Delphion Intellectual Property Network are finely tuned to help patent researchers home in on only the most relevant information pertaining to their search, excluding all else. Search engines are simply inappropriate tools for searching this kind of information. In addition, new patents are issued weekly or even daily. The Delphion Intellectual Property Network is constantly refreshed. Search engines, with their month or more long gaps between recrawling Web sites, couldn’t possibly keep up with this flood of new information.
Hoover’s vs. Thomas Register of American Manufacturers
Thomas Register of American Manufacturers
Hoover’s Online offers in-depth information for businesses about companies, industries, people, and products. It features detailed profiles of hundreds of public and private companies
- Structure: Browsable directory with powerful search engine. All pages on the site are simple HTML; all links are direct (though the URLs appeal complex). Note: some portions of Hoover’s are only available to subscribers who pay for premium content.
Thomas Register features profiles of more than 155,000 companies, including American and Canadian companies. The directory also allows searching by brand name, product headings, and even some supplier catalogs. As an added bonus, material on the Thomas Register site is updated constantly, rather than on the fixed update schedules of the printed version.
- Structure: Database access only. Further, access to the search tool is available to registered users only. This combination of database-only access available to registered users puts the Thomas Register squarely in the universe of the Invisible Web.
The Library of Congress Web Site:
Both Visible and Invisible
The U.S. Library of Congress is the largest library in the world, so it’s fitting that its site is also one of the largest on the Web. The site provides a treasure trove of resources for the searcher. In fact, it’s hard to even call it a single site, since several parts have their own domains or sub-domains.
The Library’s home page (http://ww.loc.gov/) has a simple, elegant design with links to the major sections of the site. Mousing over the links to all of the sections reveals only one link that might be invisible to the America’s Library site.
If you follow the link to the American Memory collection, you see a screen that allows you to access more than 80 collections featured on the site. Some of the links, such as those to “Today in History” and the “Learning Page,” are direct URLs that branch to simple HTML pages. However, you select the “Collection Finder” you’re presented with a directory-type menu for all of the topics in the collection. Each one of the links on this page is not only an indirect link but contains a large amount of information used to create new dynamic pages. However, once those pages are created, they include mostly direct links to simple HTML pages.
The point of this exercise is to demonstrate that even though the ultimate content available at the American Memory collection consists of content that is crawlable, following the links from the home page leads to a “barrier” in the form of indirect URLs on the Collection Finder directory page. Because they generally don’t crawl indirect URLs, most crawlers would simply stop spidering once they encounter those links, even though they lead to perfectly acceptable content.
Though this makes much of the material in the American Memory collection technically invisible, it’s also probable that someone outside of the Library of Congress has found the content and linked to it, allowing crawlers to access the material despite the apparent roadblocks. In other words, any Web author who likes content deep within the American Memory collection is free to link to it‑and if crawlers find those links on the linking author’s page, the material may ultimately be crawled, even if the crawler couldn’t access it through the “front door.” Unfortunately, there’s no quick way to confirm that content deep within a major site like the Library of Congress has been crawled in this manner, so the searcher should utilize the Library’s own internal search and directory services to be assured of getting the best possible results.
Why Use the Invisible Web?
General-purpose search engines and directories are easy to use, and respond rapidly to information queries. Because they are so accessible and seemingly all-powerful, it’s tempting to simply fire up your favorite Web search engine, punch in a few keywords that are relevant to your search, and hope for the best. But the general-purpose search engines are essentially mass audience resources, designed to provide something for everyone. Invisible Web resources tend to be more focused, and often provide better results for many information needs. Consider how a publication like Newsweek would cover a story on Boeing compared to an aviation industry trade magazine such as Aviation Week and Space Technology or how a general news magazine like Time would cover a story on currency trades vs. a business magazine like Forbes or Fortune.
In making the decision whether or use an Invisible Web resource, it helps to consider the point of view of both the searcher and the provider of a search resource. The goal for any searcher is relatively simple: to satisfy an information need in a timely manner. Of course, providers of search resources also strive to satisfy the information needs of their users, but they face other issues that complicate the equation. For example, there are always conflicts between speed and accuracy. Searchers demand fast results, but if a search engine has a large, comprehensive index, returning results quickly may not allow for a thorough search of the database.
For general-purpose search engines, there’s a constant tension between finding the correct answer vs. finding the best vs. finding the easiest answer. Because they try to satisfy virtually any information need, general-purpose search engines resolve these conflicts by making compromises. It costs a significant amount of money to crawl the Web, index pages, and handle search queries. The bottom line is that general-purpose search engines are in business to make a profit, a goal that often works against the mission to provide comprehensive results for searchers with a wide variety of information needs.
On the other hand, governments, academic institutions, and other organizations that aren’t constrained by a profit-making motive operate many Invisible Web resources. They don’t feel the same pressures to be everything to everybody. And they can often afford to build comprehensive search resources that allow searchers to perform exhaustive research within a specific subject area, and keep up-to-date and current.
Why select an Invisible Web resource over a general-purpose search engine or Web directory? Here are several good reasons:
Specialized content focus = more comprehensive results. Invisible Web resources tend to be focused on specific subject areas. This is particularly true of the many databases made available by government agencies and academic institutions. Your search results from these resources will be more comprehensive than those from most visible Web resources for two reasons. First, there are generally no limits imposed by databases on how quickly a search must be completed‑or if there are, you can generally select your own time limit that will be reached before a search is cut off. This means that you have a much better chance of having all relevant results returned, rather than just those results that were found fastest.
Second, individuals who go to the trouble of creating a database-driven information resource generally try to make the resource as comprehensive as possible, including as many relevant documents as they are able to find. This is in stark contrast to general-purpose search engine crawlers, which often arbitrarily limit the depth of crawl for a particular Web site. With a database, there is no depth of crawl issue‑all documents in the database will be searched by default.
Specialized search Interface = more control over search input and output. Let’s assume that everything on the Web could be located and accessed via a general search tool like Google or HotBot. How easy and efficient would it be to use one of these general-purpose engines when a specialized tool was available? Would you begin a search for a person’s phone number with a search of an encyclopedia? Of course not. Likewise, even if the general-purpose search engines suddenly provided the capability to find specialized information, they still couldn’t compete with search services specifically designed to find and easily retrieve specialized information. Put differently, searching with a general-purpose search engine is like using a shotgun, whereas searching with an Invisible Web resource is more akin to a taking a highly precise rifle-shot approach.
As an added bonus, most databases provide customized search fields that are subject-specific. History databases will allow limiting searches to particular eras, for example, and biology databases by species or genomic parameters. Invisible Web databases also often provide extensive control over how results are formatted. Would you like documents to be sorted by relevance, by date, by author, or by some other criteria of your own choosing? Contrast this flexibility with the general-purpose search engines, where what you see if what you get.
Increased precision and recall. Consider two informal measures of the search engine performance‑recall and precision. Recall represents the total number of relevant documents retrieved in response to a search query, divided by the total number of relevant documents in the search engine’s entire index. One hundred percent recall means that the search engine was able to retrieve every document in its index that was relevant to the search terms. Measuring recall alone isn’t sufficient, however, since the engine could always achiebve100 percent recall simply by returning every document in its index.
Recall is balanced by precision. Precision is the number of relevant documents retrieved divided by the total number of documents retrieved. If 100 pages are found, and only 20 are relevant, the precision is (100/20), or 20 percent. Relevance, unfortunately, is strictly a subjective measure. The searcher ultimately determines relevance after fully examining a document and deciding whether it meets the information need.
To maximize potential relevance, search engines strive to maximize recall and precision simultaneously. In practice, this is difficult to achieve. As the size of a search engine index increases, there are likely to be more relevant documents for any given query, leading to a higher recall percentage. As recall increases, precision tends to decrease, making it harder for the searcher to locate relevant documents.
Because they are often limited to specific topics or subjects, many Invisible Web and specialized search services offer greater precision even while increasing total recall. Narrowing the domain of information means there is less extraneous or irrelevant information for the search engine to process. Because Invisible Web resources tend to have smaller databases, recall can be high while still offering a great deal of precision, leading to the best of all possible worlds: higher relevance and greater value to the searcher.
Invisible Web resources = highest level of authority. Institutions or organizations that have a legitimate claim on being an unquestioned authority on a particular subject maintain many Invisible Web resources. Unlike with many sites on the visible Web, it’s relatively easy to determine the authority of most Invisible Web sites. Most offer detailed information about the credentials of the people responsible for maintaining the resource. Others feature awards, citations, or other symbols of recognition from other acknowledged subject authorities. Many Invisible Web resources are produced by book or journal publishers with sterling reputations among libraries and scholars.
The answer may not be available elsewhere. The explosive growth of the Web, combined with the relative ease of finding many things online, has led to the widely held but widely inaccurate belief that “if it’s not on the Web, it’s not online.” There are a number of reasons this belief simply isn’t true. For one thing, there are vast amounts of information available exclusively via Invisible Web resources. Much of this information is in databases, which can ‘t be directly accessed by search engines, but it is definitely online and often freely available.
When to Use the Invisible Web
It’s not always easy to know when to use an Invisible Web resource as opposed to a general search tool. As you become more familiar with the landscape of the Invisible Web, there are several rules of thumb you can use when deciding to use an Invisible Web resource.
When you’re familiar with a subject. If you know a particular subject well, you’ve likely already discovered one or more Invisible Web resources that offer the kind of information you need. Familiarity with a subject also offers another advantage: knowledge of which search terms will find the “best” results in a particular search resource, as well as methods for locating new resources.
When you’re familiar with specific search tools. Some Invisible Web resources cover multiple subjects, but since they often offer sophisticated interfaces you’ll still likely get better results from them compared to general-purpose search tools. Restricting your search through the use of limiters, Boolean logic, or other advanced search functions generally makes it easier to pull a needle from a haystack.
When you’re looking for a precise answer. When you’re looking for a simple answer to a question, the last thing you want is a list of hundreds of possible results. No matter‑an abundance of potential answers is what you’ll end up with if you use a general-purpose search engine, and you’ll have to spend the time scanning the result list to find what you need. Many Invisible Web resources are designed to perform what are essentially lookup functions, when you need a particular fact, phone number, name, bibliographic record, and so on.
When you want authoritative, exhaustive results. General purpose search engines will never be able to return the kind of authoritative, comprehensive results that Invisible Web resources can. Depth of crawl, timeliness, and the lack of selective filtering fill any result list from a general-purpose engine with a certain amount of noise. And, because the haystack of the Web is so huge, a certain number of authoritative documents will inevitably be overlooked.
When timeliness of content is an issue. Invisible Web resources are often more up-to-date than general-purpose search engines and directories.
Top 20 Invisible Web Categories
Public Company Filings. The U.S. Securities and Exchange Commission (SEC) and regulators of equity markets in may other countries require publicly traded companies to file certain documents on a regular schedule or whenever an event may have a material effect on the company. These documents are available in a number of locations, including company Web sites. While many of these filings may be visible and findable by a general-purpose search engine, a number of Invisible Web services have built comprehensive databases incorporating this information. FreeEDGAR (http://www.freedgar.com), 10K Wizard (http://www.sedar.com) are examples of services that offer sophisticated searching and limiting tools as well as the assurance that the database is truly comprehensive. Some also offer free e-mail alert services to notify you that the companies you choose to monitor have just filed reports.
Telephone Numbers. Just as telephone white pages serve as the quickest and most authoritative offline resource for locating telephone numbers, a number of Invisible Web services exist solely to find telephone numbers. InfoSpace (http://ww.infospace.com), Switchboard.com (http://www.switchboard.com), and AnyWho (http://www.anywho.com) offer additional capabilities like reverse-number lookup or correlating a phone number with an e-mail address. Because these databases vary in currency it is often important to search more than one to obtain the most current information.
Customized Maps and Driving Directions. While some search engines, like Northern Light, have a certain amount of geographical “awareness” built in, none can actually generate a map of a particular street address and its surrounding neighborhood. Nor do they have the capability to take a starting and ending address and generate detailed driving directions, including exact distances between landmarks and estimated driving directions, including exact distances between landmarks and estimated driving time. Invisible Web resources such as Mapblast (http://www.mapblast.com) and Mapquest (http://www.mapquest.com)—and the newer Yahoo Local Maps (http://maps.yahoo.com/beta); Google Local (http://maps.google.com/); and Windows Live Local (http://local.live.com)—are designed specifically to provide these interactive services.
Clinical Trials. Clinical trials by their very nature generate reams of data, most of which is stored from the outset in databases. For the researcher, sites like the New Medicines in Development (http://pharma.org/searchcures/newmeds/webdb) database are essential. For patients searching for clinical trials to participate in, ClinicalTrials.gov (http://www.clinicaltrials.gov) and Center Watch’s (http://www.centerwatch.com) Clinical Trials Listing Service are invaluable.
Patents. Thoroughness and accuracy are absolutely critical to the patent searcher. Major business decisions involving significant expense or potential litigation often hinge on the details of a patent search, so using a general-purpose search engine for this type of search is effectively out of the questions. Many government patent offices maintain Web sites, but Delphion’s Intellectual Property Network (http://www.delphion.com/) allows full-text searching of U.S. and European patents and abstracts of Japanese patents simultaneously. Additionally, the United States Patent Office (http://www.uspto.gov) provides patent information dating back to 1790, as well as U.S. Trademark data.
Books. The growth of the Web has proved to be a boon for bibliophiles. Countless out of print booksellers have established Web sites, obliterating the geographical constraints that formerly limited their business to local customers. Simply having a Web presence, however, isn’t enough. Problems with depth of crawl issues, combined with a continually changing inventory, make catalog pages from used booksellers obsolete or inaccurate even if they do appear in the result list of a general-purpose search engine. Fortunately, sites like Alibris (http://www.alibris.com) and Bibliofind (http://www.bibliofind.com) allow targeted searching over hundreds of specialty and used bookseller sites.
Library Catalogs. There are thousands of Online Public Access Catalogs (OPACs) available on the Web, from national libraries like the U.S. Library of Congress and the Bibliotheque Nationalel de France, academic libraries, local public libraries, and many other important archives and repositories. OPACs allows searches for books in a library by author, title, subject keywords, or call number, often providing other advanced search capabilities. WebCATS, Library Catalogs on the World Wide Web (http://www.libdex.com/) is an excellent directory of OPACs around the world. OPACs are great tools to verify the title or author of a book.
Authoritative Dictionaries. Need a word definition? Go directly to an authoritative online dictionary. Merriam-Webster’s Collegiate (http://www.m-w.com) and the Cambridge International Dictionary of English (http://dictionary.cambridge.org/) are good general dictionaries. Scores of specialized dictionaries also provide definitions of terms from fields ranging from aerospace to zoology. Some Invisible Web dictionary resources even provide metasearch capability, checking for definitions in hundreds of online dictionaries simultaneously. OneLook (http://www.onelook.com) is a good example.
Environmental Information. Need to know who’s a major polluter in your neighborhood? Want details on a specific country’s position in the Kyoto Treaty? Try the Envirofacts multiple database search (http://www.epa.gov/enviro/)
Historical Stock Quotes. Many people consider stock quotes to be ephemeral data, useful only for making decisions at a specific point in time. Stock market historians and technical analysts, however, can use historical data to compile charts of trends that some even claim to have certain amount of predictive value. There are numerous resources available that contain this information, including BigCharts.com (http://www.bigcharts.com/historical/).
Historical Documents and Images. General-purpose search engines don’t handle images well. This can be a problem with historical documents, too, as many historical documents exist on the Web only as scanned images of the original. The U.S. Library of Congress American Memory Project (http://memory.loc.gov) is a wonderful example of a continually expanding digital collection of historical documents and images. The American Memory Project also illustrates that some data in a collection may be “visible” while other portions are “invisible.”
Economic Information. Governments and government agencies employ entire armies of statisticians to monitor the pulse of economic conditions. This data is often available online, but rarely in a form visible to most search engines. Recon-Regional Economic Conditions (http://www2.fdic.gov/recon/) is an interactive database from the Federal Deposit Insurance Corporation that illustrates this point.
Award Winners. Who won the Nobel Peace Price in 1938? You might be able to learn that it was Viscount Cecil of Chelwood (Lord Edgar Algernon Robert Gascoyne Cecil) via a general-purpose search engine, but the Nobel-e-museum (http://www.nobel.se/) site will provide the definitive answer. Other Invisible Web databases have definitive information on major winners of awards ranging from Oscar (http://www.oscars.org./awards_db/) to the Peabody Awards (http://www.peabody.uga.edu/recipients/search.html).
Translation Tools. Web-based translation services are not search tools in their own right, but they provide a valuable service when a search has turned up documents in a language you don’t understand. Translation tools accept a URL, fetch the underlying page, translate it into the desired language and deliver it as a dynamic document. AltaVista (http://world.altavista.com/) provides such a service. Please note the many limitations and frequent translation issues that often arise. These tools, while far from perfect, will continue to improve with time.
Postal Codes. Even though e-mail is rapidly overtaking snail mail as the world’s preferred method of communication, we all continue to rely on the postal service from time to time. Many postal authorities such as the Royal Mail in the United Kingdom (http://www.royalmail.com/quick_tools/postcodes/default.htm) provide postal code look-up tools.
Basic Demographic Information. Demographic information from the U.S. Census and other sources can be a boon to marketers or anyone needing details about specific communities. One of the many excellent starting points is the American Fact-Finder (http://factfinder.census.gov/) The utility that this site provides seems to almost never end.
Interactive School Finders. Before the Web, finding the right university or graduate school often meant a trek to the library and hours scanning course catalogs. Now it’s easy to locate a school that meets specific criteria for academic programs, location, tuition costs, and many other variables. Peterson’s GradChannel (http://iiswinprd01.petersons.com/GradChannel/) is an excellent example of this type of search resource for students, offered by a respected provider of school selection data.
Campaign Financing Information. Who’s really buying‑or stealing‑the election? Now you can find out by accessing the actual forms filed by anyone contributing to a major campaign. The Federal Elections Commission provides several databases (http://www.fec.gov/) while a private concern called Fecinfo.Com (http://www.fecinfo.com) “massages” government-provided data for greater utility. Fedinfo.com has a great deal of free material available in addition to several fee-based resources. Many states are also making this type of data available.
Weather Data. If you don’t trust your local weatherperson, try an Invisible Web resource like AccuWeather (http://www.accuweather.com). This extensive resource offers more than 43,000 U.S. 5-day forecasts, international forecasts, local NEXRAD Doppler radar images, customizable personal pages, and fee-based premium services. Weather information clearly illustrates the vast amount of real-time data available on the Internet that the general search tools do not crawl. Another favorite is Automated Weather Source, found at (http://aws.com/globalwx.html) This site allows you to view local weather conditions in real-time via instruments placed at various sites (often located at schools) around the country.
- Art Gallery Holdings. From major national exhibitions to small co-ops run by artists, countless galleries are digitizing their holdings and putting them online. An excellent way to find these collections is to use ADAM, the Art, Design, Architecture & Media Information Gateway (http://adam.ac.uk/) ADAM is a searchable catalogue of more than 2,500 Internet resources whose entries are all visible.
What’s NOT on the Web‑Visible or Invisible
There’s an entire class of information that’s simply not available on the Web, including the following:
Proprietary databases and information services. These include Thomson’s Dialog service, LexisNexis, and Dow Jones, which restrict access to their information systems to paid subscribers.
Many governmental and public records. Although the U.S. government is the most prolific publisher of content both on the Web and in print, there are still major gaps in online coverage. Some proprietary services such as KnowX (http://www.knox.com) offer limited access to pubic records for a fee. Coverage of government and public records is similarly spotty in other countries around the world. While there is a definite trend toward moving government information and ;public records online, the sheer mass of information will prohibit all of it from going online. There are also privacy concerns that may prevent certain types of public records from going digital in a form that might compromise an individual’s rights.
Scholarly journals or other “expensive” information. Thanks in part to the “publish or perish” imperative at modern universities, publishers of scholarly journals or other information that’s viewed as invaluable for certain professions have succeeded in creating a virtual “lock” on the market for their information products. It’s a very profitable business for these publishers, and they wield an enormous amount of control over what information is published and how it’s distributed. Despite ongoing, increasingly acrimonious struggles with information users, especially libraries, who often have insufficient funding to acquire all of the resources they need, publishers of premium content see little need to change the status quo. As such it’s highly unlikely that this type of content will be widely available on the Web any time soon.
There are some exception ns. Northern Light’s Special Collection, for example, makes available a wide array of reasonably priced content that previously was only available via expensive subscriptions or site licenses from proprietary information services. ResearchIndex, can retrieve copies of scholarly papers posted on researchers’ personal Web sites, bypassing the “official” versions appearing in scholarly journals. But this type of semi-subversive “Napster-like” services may come under attack in the future, so it’s tool early to tell whether it will provide a viable alternative to the official publications or not. For the near future, public libraries are one of the best sources for this information, made available to community patrons and aid for by tax dollars.
Full Text of all newspapers and magazines. Very few newspapers or magazines off full-text archives. For those publications that do, the content only goes back a limited time‑10 or 20 years at the most. There are several reasons for this. Publishers are very aware that the content they have published quite often retains value over time. Few economic models have emerged that allow publishers to unlock that value as yet. Authors’ rights are another concern. Many authors retained most re-use rights to the materials printed in magazines and newspapers. For content published more than two decades ago, reprints in digital format were not envisioned or legally accounted for. It will take time for publishers and authors to forge new agreements and for consumers of Web content to become comfortable with the notion that not every thing on the Web is free. New micropayment systems, or “all you can eat” subscription services will emerge that should remove some of the current barriers keeping magazine and newspaper content off the Web. Some newspapers are placing archives of their content on the Web. Some newspapers are placing archives of their content on the Web. Often the search function is free but retrieval of full text is fee based‑for example, the services offered by Newslibrary, at http://www.newslibrary.com.
And finally, perhaps the reason users cannot find what they are looking for on either the visible or Invisible Web is simply because it’s just not there. While much of the world’s print information has migrated to the Web, there are and always be million ns of documents that will never placed online. The only way to locate these printed materials will be via traditional methods: using libraries or asking for help from people who have physical access to the information.
Spider Traps, Lies, and Other Chicanery
Though there are many technical reasons the major search engines don’t index the Invisible Web, there are also “social” reasons having to do with the validity, authority, and quality of online information. Because the Web is open to everybody and anybody, a good deal of its content is published by non-experts or‑even worse‑by people with a strong bias that they seek to conceal from readers. Search engines must also copy with unethical Web page authors who seek to subvert their indexes with millions of bogus “spam” pages. Most of the major engines have developed strict guidelines for dealing with spam that sometimes has the unfortunate effect of excluding legitimate content.
No matter whether you’re searching the visible or Invisible Web, it’s important always to maintain a critical view of the information you’re accessing. For some reason, people often lower their guard when it comes to information on the Internet. People who would scoff if asked to participate in an offline chain-mail scheme cast common sense to the wind and willingly forward hoax e-mails to their entire address books. Urban legends and all manner of preposterous stories abound on the Web.
Here are some important questions to ask and techniques to use for accessing the validity and quality of online information, regardless of its source.
Who Maintains the Content? The first question to ask of any Web site is who’s responsible for creating and updating it. Just as you would with any offline source of information, you want to be sure that the author and publishers are credible and the information n they are providing can be trusted.
Corporate Web sites should provide plenty of information about the company, its products and services. But corporate sites will always seek to portray the company in the best possible light, so you’ll need to use other information sources to balance favorable bias. If you’re unfamiliar with a company, try searching for information about it using Hoover’s. For many companies, AltaVista provides a link to a page with additional “facts about” the company, including a capsule overview, news details of Web domains owned, and financial information.
Information maintained by government Web sites or academic institutions is inherently more trustworthy than other types of Web content, but it’s still important to look at things like the authority of the institution or author. This is especially true in the case of academic institutions, which often make server space available to students who may publish anything they like without worrying about its validity.
If you’re reading a page created by an individual, who is the author? Do they provide credentials or some other kind of proof that they write with authority? Is contact information provided, or is the author hiding behind the veil of anonymity? If you can’t identify the author or maintainer of the content, it’s probably not a good idea to trust the resource, even if it appears to be of high quality in all other respects.
What is the Content Provider’s Authority? Authority is a measure of reputation. When you’re looking at a Web site, it is the author or producer of the content a familiar name? If not, what does the site provide to assert authority?
For an individual author, look for a biography of the author citing previous work or awards, a link to a resume or other vita that demonstrates experience, or similar relevant facts that prove the author has authority. Sites maintained by companies should provide a corporate profile, and some information about the editorial standards used to select or commission work.
Some search engines provide an easy way to check on the authority of an author or company. Google, for example, tries to identify authorities by examining the link structure of the entire Web to gauge how often a page is cited in the form of a link by other Web page authors. It also checks to see if there are links to these pages from “important” sites of the web that have authority. Results in Google for a particular query provide an informal gauge of authority. Results in Google for a particular query provide an informal gauge of authority. Beware, though, that this is only informal‑even a p[age created by a Nobel laureate may not rank highly on Google if other important pages on the Web don’t link to it.
Is There Bias? Bias can be subtle, can be easily camouflaged in sites that deal with seemingly non-controversial subjects. Bias is easy to spot when it takes the form of a one-sided argument. It’s harder to recognize when it dons a Janusian mask of two-sided “argument” where one side consistently (and seemingly reasonably) always prevails. Bias is particularly insidious on so-called “news” sites that exist mainly to promote specific issues or agendas. The key to avoiding bias is to look for balanced writing.
Another form of bias on the Web appears when a page appears to be objective, but is sponsored by a group or organization which a hidden agenda that may not be apparent on the site. It’s particularly important to look for this kind of thing in health or consumer product information sites. Some large companies fund information resources for specific health condition s, or advocate a particular lifestyle that incorporates a particular product. While the companies may not exert direct editorial influence over the content, content creators nonetheless can’t help but be aware of their patronage, and may not be as objective as they might be. On the opposite site of the coin, the Web is a powerful medium for activist groups with an agenda against a particular company or industry. Many of these groups have set up what appear to be the objective Web sites presenting seemingly bal aced information when in fact they are extremely one-sided and biased.
There’s no need to be paranoid about bias. In fact, recognizing bias can be very useful in helping understand an issue in depth from a particular point of view. The key is to acknowledge the bias and take steps to filter, balance, and otherwise gain perspective on what is likely to be a complex issue.
Examine the URL. URLs can contain a lot of useful clues about the validity and authority of a site. Does the URL seem “appropriate” for the content? Most companies, for example, use their name or a close approximation in their primary URL. A page stored on a free service like Yahoo’s GeoCities or Lycos-Terra’s Tripod is not likely to be an official company Web site. URLs can also reveal bias.
Deceptive page authors can also feed search engine spiders bogus content using cloaking techniques, but once you’ve actually retrieved a page in your browser, its URLs cannot be spoofed. If a URL appears to contain suspicious or irrelevant words to the topic it represents, it’s likely a spurious source of information.
Examine Outbound Links. The hyperlinks included in a document can also provide clues about the integrity of the information on the page. Hyperlinks were originally created to help authors cite references, and can provide a sort of online “footnote” capability. Does a page link to other credible sources of information? Or are most of the links to other internal content on a Web site?
Well-balanced sites have a good mix of internal and external links. For complex or controversial issues, external links are particularly important. If they point to other authorities on a subject, they allow you to easily access alternative points of view from other authors. If they point to less credible authors, or ones that share the same point of view as the author, you can be reasonably certain you’ve uncovered bias, whether subtle or blatant.
Is the Information Current? Currency of information is not always important, but for timely news, events, or for subject areas where new research is constantly expanding a field of knowledge, currency is very important.
Look for dates on a page. Be careful‑automatic date scripts can be included on a page so that it appears current when in fact it may be quite dated. Many authors include “dateline” or “updated” fields somewhere on the page.
It’s also important to distinguish between the date in search results and the date a document was actually published. Some search engines include a date next to each result. These dates often have nothing to do with the document itself‑rather, they are the date the search engine’s crawler last spidered the page. While this can give you a good idea of the freshness of a search engine’s database, it can be misleading to assume that the document’s creation date is the same. Always check the document itself if the date is an important part of your evaluation criteria.
Use Common Sense. Apply the same filters to the Web as you do to other sources of information in your life. Ask yourself: “How would I respond to this if I were reading it in a newspaper, or in a piece of junk mail?” Just because something is on the Web doesn’t mean you should believe it‑quite the contrary, in many cases.
For excellent information about evaluating the quality of Web resources, I recommend Genie Tyburski’s excellent Evaluating The Quality Of Information On The Internet at http://www.virtualchase.com/quality/index.html.
Keeping Current with the Invisible Web
Just as with the visible Web, new Invisible Web resources are being made available all the time. How do you keep up with potentially useful new additions? One way i8s to subscribe to the “Invisible Web Newsletter” published by the authors. Visit the companion site to this book for subscription details.
There are also several useful, high-quality current awareness services that publish newsletters that cover Invisible Web resources. These newsletters don’t limit themselves to the Invisible Web, buy the news and information they provide is exceptionally useful for all serious Web searchers. All of these newsletters are free.
The Scout Report
The Scout Report provides the closes thing to an “official” seal of approval for quality Web sites. Published weekly, it provides organized summaries of the most valuable and authoritative Web resources available. The Scout Report Signpost provides the full-text search of nearly 6,000 of these summaries. The Scout Report staff is made up of a group of librarians and information professionals and their standards for inclusion in the report are quite high.
Librarians’ Index to the Internet (LII)
This searchable, annotated directory of Web resources, maintained by Carole Leita and a volunteer team of more than 70 reference librarians, is organized into categories including “best of,” “directories,” “databases,” and “specific resources.” Most of the Invisible Web content reviewed by LII falls in the “databases” and “specific resources” categories. Each entry also includes linked cross-references, making it a browser’s delight.
Leita also publishes a weekly newsletter that includes 15-20 of the resources added to the Web site during the previous week.
ResearchBuzz is designed to cover the world of Internet research. To that end this site provides almost daily updates on search engines, new data-managing software, browser technology, large compendiums of information, Web directories, and Invisible Web databases. If in doubt, the final question is “Would a reference librarian find it useful?” If the answer’s yes, in it goes.
ReseardchBuzz’s creator, Tara Calishain, is author of numerous Internet research books, including Official Netscape Guide to Internet Research. Unlike most of the other current awareness services described here, Calishain often writes in-depth reviews and analysis of new resources, pointing out both useful features and flaws in design or implementation.
Free Pint is an e-mail newsletter dedicated to helping you find reliable Web sites and search the Web more effectively. It’s written by and for knowledge workers who can’t afford to spend valuable time sifting through junk on the Web in search of a few valuable nuggets of e-gold. Each issue of Free Pint has several regular sections. William Hann, Managing Editor, leads off with an overview of the issue and general news announcements, followed by a “Tips and Techniques” section where professionals share their best searching tips and describe their favorite Web sites.
The Feature Article covers a specific topic in detail. Recent articles have been devoted to competitive intelligence on the Internet, central and eastern European Web sources, chemistry resources, Web sites for senior citizens, and a wide range of other topics. Feature articles are between 1,000-2,000 words long, and are packed with useful background information, in addition to numerous annotated links to vetted sites in the article’s subject area. Quite often these are Invisible Web resources. Lone nice aspect of Free Pint is that it often focuses on European resources that aren’t always well known in North America or other parts of the world.
Internet Resources Newsletter
Internet Resources Newsletter’s mission is to raise awareness of new sources of information on the Internet, particularly for academics, students, engineers, scientists, and social scientists. Published monthly, Internet Resources Newsletter is edited by Heriot-Wyatt University Library staff and published by Heriot-Watt University Internet Resource Centre.