(Music)
21CIF Team : Welcome to the Full Circle Resource Kit Podcast. In today's interview we are speaking about searching the Deep Web with Laura Cohen, Library Web Administrator, at the State University of New York in Albany.
You wrote a fascinating article about what has been called the Invisible Web and you made a case for why it should be called something else. Just tell us a little bit about what your thinking was about that term, the Invisible Web and what you are calling it now.
Laura: Yes, I've been advocating for calling the Invisible Web the Deep Web which is opposed to the Surface Web of static web pages. Static web pages are those that are writting in .htm or .html and so on. They are just basic web pages that are marked up with html. The Deep Web is that part of the Web that is not contained in static web pages but you'll find it in databases, you'll find it in non-print resources or non-textual resources such as multimedia, software, and so on.
The issue with the Invisible Web is that no recorded information, unless you have recorded it with invisible ink is invisible. I think it is really an insult to say that just because there is a certain type of information out there, that you cannot necessarily find in general search engines, well therefore it's invisible. The problem is, that's very search engine-centric, or at least it's very general search engine-centric. By general search engine-centric I'm talking about the general search engines out there that we all know and pretty much love - Google, Yahoo, Altavista, Clusty and so many of the other ones. These search engines can pretty much only pick up on the static, surface web. But that's OK, there are many other kinds of search engines that are coming along now that will allow you to get at this other sort of content.
21CIF Team : So, just the fact that a search engine that many people would use like Google or Yahoo can't find a particular document because of the way that the crawlers work and the robots and things that are out indexing those pages doesn't mean that it is not findable. And that's really where you are making the case for calling it something other than the Invisible Web.
Laura: That's absolutely correct. And as an academic librarian, I also think about some of the newer search tools that are coming out that are so important to research that you won't find on general search tools such as Google Book Search, Google Scholar, Amazon Search Inside the Book, and so on. I mean, these tools alone have so much opened up the world of resarch. You won't find the content from these tools on general search tools. You need to go into the specialty databases.
Another thing that strikes me to, is the fact that we have never called our own library catalogues and our licensed databases as the Invisible Library. These are just specialized finding tools that we use, and need to use, to get at library content. So, if there is no Invisible Library there is no Invisible Web.
21CIF Team : So maybe the fact that we even use the term, Invisible Web, just tells us something about how much search engines have become a part of our way of looking at research, rather than where the information can actually be found.
Laura: I think that is very true, and I think that it's a cautionary tale for librarians to be sure that we continue to think for ourselves.
21CIF Team : Excellent point. What are some of the challenges that you have found with your students and faculty in helping them discover what is in the Deep Web?
Laura: Well, I give them a series of choices and I always begin with, "What is your query? What is it that you are looking for?". There is no 'one size fits all' kind of a strategy. The number one thing that we always need to look at is, "What is a person's query? What are you looking for?". Once you understand that, then you can begin to provide advice. And depending on the query, "I have basically three strategies that I give people".
The first one is to tell folks that you can find some limited
Deep Web information and content in general search engines. I think we are all familiar with being able to go to something like Google or Altavista and so on, and looking at different tabs for news, images, video - that kind of thing. You also will find some Deep Web content integrated into search results, such as, Google having .pdfs, Word documents. You can with specialty search syntax you can look up maps, phone numbers, flights - that kind of thing. Also, you are now beginning to see some small amount of databased content appearing in the results of general search engines, such as, Yahoo Directory topic pages. SurfWax is a search engine, for example, that integrates Yahoo News into it's results, into it's general results list. So, that's my first piece of advice - is in a very limited way you can get some Deep Web content in search engines.
But to get at much more broad-based and specialty kinds of content there are two other pieces of advice that I give. One is to look for specialty search tools. For example, and this is a really obvious case, findsounds.com I think you can figure out what that's all about. It's just a wonderful search tool for finding sounds. Technorati, that's a blog search engine. And again we get back to the idea of calling the Deep Web, the Invisible Web. Are you suddenly going to say that the content of blogs is invisible? I don't think so! RSS feeds - that's a major new way of getting at content. There are a number of RSS search engines out there. One that I really love is SurfWax LookAhead in which you can type a query and as you are typing you actually begin to pull up RSS feeds that contain your search terms. And then again, just to bring up Google Book Search, Google Scholar, Amazon Search Inside the Book and so on.
21CIF Team : What about the databases that many organizations and groups that have extensive web sites - the databases that they create that you can only access once you get to their homepage - that seems to be another great source of content.
Laura: That's my third piece of advice. So you have anticipated me beautifully. I call this "split-level searching". It's a two-step process to get at this content. You can't, of course, predict what's going to be in a database. You just don't know because the technology of the web is expanding so much and in order to maintain large and varied websites a lot of this content is being put into databases. So the first thing you need to do is to do your best to go to a general search engine and do a topical search. And look through your results list to see whether there is a site on the list containing a database that you can then search. That would be the second level - is to go to the site that contains the database and search it.
To give you one example, let's say you want to explore the plants of North America. You could go to Google, or Yahoo or any of the other major search tools and search for the phrase North American plants. And what you will probably find in most of those search engines is what's known as the Plants Database from the US Department of Agriculture. Depending on your search in the search tool it will certainly appear probably on your first screen. So the second level of your split-level searching is to go to the Plants Database site and to search the database that is contained within that site.
Dan: One of the things that we have actually done is to recommend that people use the term database or archive or catalogue as part of their query along with that topic.
Laura: Directory is another good term to throw into a search like that. We're of like minds because I have the same technique that I use with students and it's a very successful one.
21CIF Team : Thank you so much.
Laura: You're welcome. I appreciate the invitation. This was fun and very interesting and you are doing great work.
21CIF Team: This is a production of the 21st Century Information Fluency Project at the Illinois Math and Science Academy
(Music) |