SILS, U. of North
Carolina, Chapel Hill
INLS-509
(Old INLS-172) -- Information Retrieval
Bob Losee
Manning 302
962-7150
losee at unc dot edu
Fall 2006
Brief Description:
An
introductory survey of information filtering and retrieval, with an emphasis on
developing the student's understanding of the relationship between the
algorithms used by search engines, the query and document, and system
performance. This is an information
science course, not an information technology course. The course is required for students in the
School’s Master’s in Information Science program and will emphasize basic
knowledge useful for those who will be in leadership positions in the
information professions.
Course WWW links:
http://InformationRetrieval.US
(if you forget, there is link from my home
page)
Course Outline
Readings below are required except for those preceded by an asterisk
(*) Note that students are never expected to absorb all the
material or understand all the mathematics in the articles.
Introduction:
Retrieval and Filtering
Losee, Lectures Notes (available in bookstore), Chapter 1.
Sparck-Jones and Willett, Readings in Information Retrieval ("RIR" below), Morgan Kaufmann
Publishers, 1997. Chapter 1.
* Baeza-Yates and Ribeiro-Neto, Chapters 4, 10
* Case, Donald, Looking for Information: A Survey of Research
on Information Seeking, Needs, and Behavior, Academic Press, 2002.
* Sugar, “User-centered Perspectives of Information Retrieval
Research and Analysis Methods,” Annual
Review of Information Science and Technology, 1995, 77-109.
Probability
Losee, Lecture Notes,
Chapter 2.
Students may wish to consult one or more of the "management
science" books in the UNC libraries.
Indexing, Document, and
Media Representation
Losee, Lecture Notes, Chapter 3
RIR, Chapter 2, articles by Joyce and Needham (p. 15); Luhn (p.
21); Doyle (p. 25); Cleverdon (p. 47); Salton and Lesk (p. 60.)
* Iivonen and Sonnenwald, “From Translation to Navigation of
Different Discourses: a Model of Search Term Selection during the Pre-online
Stage of the Search Process,” Journal of the American Society for
Information Science, 49 (Apr.
1 '98), 312-26.
* Svenonius, "Access to Nonbook Materials: The Limits of
Subject Indexing for Visual and Aural Languages," Journal of the American Society for Information Science, 45(8) Sept. 94, 600-606.
* Salton and McGill, Introduction
to Modern Information Retrieval, McGraw-Hill, 1983, Chapter 3.
* Salton, Automatic Text
Processing, Addison-Wesley, 1989, Chapter 9.
Retrieval Performance
RIR, Chapter 3, article by Saracevic (p. 143.)
RIR, Chapter 4, articles by Saracevic, Kantor, Chamis, and
Trivison (p. 175); Cooper (p. 191); Tague-Sutcliffe (p. 205); Keen (p. 217.)
* Baeza-Yates and Ribeiro-Neto, Chapter 3.
Losee, Lecture Notes, Chapter
4.
* Losee, Lecture Notes, Chapter
6.
* Van Rijsbergen, Information
Retrieval, 2nd ed., Butterworths, 1979, Chapter 7.
Similarity and Retrieval Decisions
RIR, Chapter 5, articles by Cooper(p. 265); Belkin, Oddy, and Brooks (p. 299.)
RIR, Chapter 6, articles
by Salton and Buckley (p. 355); Croft and Harper (p. 339.)
RIR, Chapter 7, article
by Tenopir and Cahn (p. 446.)
Losee, Lecture Notes, Chapter 5
* Van Rijsbergen,
Chapters 5 & 6.
Relationships between
Terms, Natural Language Processing
Losee, Lecture Notes, Chapter 8, 9, 11.
RIR, Chapter 5, article by Turtle and Croft (p. 287.)
RIR, Chapter 6, article by Porter (p. 313.)
RIR, Chapter 8, articles by Salton, Allan, Buckley and Singhal
(p. 478); Rau (p. 527); Johnson, Paice, Black, and Neal (p. 538.)
* Chowdhury, “Natural Language Processing,” in Annual Review of
Information Science and Technology, 2003.
Rule Based and Logical Systems
Losee, Lecture Notes, Chapter 10.
* Forsyth and Rada, Machine
Learning: Applications in Expert Systems and Information Retrieval, Wiley,
1986, Chapters 6-14.
Coding and Compression
* Salton, 1989, Chapters
5 & 6.
* Losee, Science of Information, 1990, Chapter 2.
Course
Evaluation:
Quality of
class participation 40%
Critiques of readings 30%
Other homework 30%
Critiques of Readings:
For some
articles listed on the course schedule, students are expected to write a critique
of the article of 5 to 8 sentences in length (maximum ¾ page single spaced, 1
page double spaced) and hand in the critique (on paper, not via email, and use
serif fonts for the body of the text) by the beginning of the class on the due
date listed on the schedule. The critiques should be constructive and might include (1) questions that arose as you read
the article whose answer would be useful, (2) suggestions for improving the
research described in the article, (3) ideas about additional research that
might be conducted in this area, and (4) possible research questions that could
be turned into (and are focused enough and small enough to be) SILS Master’s
papers, along with methodologies for addressing these questions. Do not criticize the author’s writing style
or the choice of topic. The one lowest
critique grade will be dropped, to cover “bad days,” critiques that don’t get
handed in on-time, or sickness.
Information Retrieval Leadership Proposals:
Each student will develop three Information Retrieval Leadership
Proposals. The Leadership Proposal areas
(due dates for printed proposals are on the class schedule) are
Proposal 1: Expressions of
information needs as queries by individuals or groups; query languages; means
for eliciting information needs.
Proposal 2: Univariate (statistically
independent) feature, document, and query matching and similarities, assuming
term independence; indexing (as viewed from retrieval).
Proposal 3: Multivariate similarity or
matching systems; multivariate reasoning systems; natural language processing.
The proposals are due at
the start of class on the day indicated.
Each proposal should be a total of 2 to 4 pages, single spaced. Do not
use a sans serif font; these fonts (e.g. Helvetica or Arial) are designed for
headlines and captions, not the body of text in a paper. As the title for each paper, state clearly
what question you are asking, formulated as an English language question
with a question mark at the end. The
proposal should address the nature of the problem, a discussion of how results
and theory in the literature "support" the problem, methodology, the
kinds of results you expect to find, and the usefulness of the answer to your
question. The question and its answer should address issues bigger than found
at one site or one system or one language; the most useful questions are
generic questions that are of the form “is X better than Y?” Select a question whose answer would make you
a leader in IR by suggesting ways people should make decisions differently or
better. Descriptive studies are
acceptable but always considered less useful than constructive studies that
make concrete recommendations. The focus
of each proposal needs to be on a question closely related to the topic for the
date, with other information retrieval system considerations being
secondary. Grading will be based upon
how well the proposal addresses the question related to the topic, the
usefulness of the proposed analysis, how answering the question is feasible as
a student 3 credit project or master's paper, and the quality of the proposed
methodology at answering the question.
Proposing a small project that leads to definite knowledge and possible
improvement of practice is always better than a larger project which just
amasses data but doesn’t lead to much understanding and the improvement of
practice.
For the first proposal,
your question should not discuss or evaluate a particular information system or
information resource, or the use of a system or systems by users. Propose a study of information needs independent of how the need might be satisfied or
how searching for an answer takes place.
You may look at information use, but only as a way to study the focus of
this proposal, information need. You
might want to think about psychological studies of individuals, to learn how
needs are formulated, felt, or expressed, or you might wish to focus on a
particular functional group and their particularly different needs or
expressions of needs. If you start
writing about how a system serves people or how people search for information,
stop.
For the second proposal, your question should address matters
associated with individual terms, either in the area of indexing or
retrieval. You can address multiple term
systems; however, the terms should be treated as independent of each other (as
do most of the retrieval models discussed up to this point in the course).
For the third proposal, your question should explicitly address
systems using the relationships that exist between document features and
consider how this would impact retrieval performance. Methods of looking at these relationships
might include statistical dependencies, multivariate machine learning
techniques, linguistic (syntactic or semantic) information, or a logical system
based on a thesaurus.
Warning: Don’t write on a topic. You should be writing to show how the
methodology will answer the question you provide. If your methodology won’t provide a
definitive (or at least solid) answer to the question, the question may be too
broad and might be narrowed further.
Doing a good job on a professionally relevant but narrow question is always
better than a much weaker answer to a broader question. Each question-answer combination should show
how to lead the field of information retrieval.
Each student is expected to conduct a
small research project and write up the project in a paper of 4 to 10 pages of
text, single spaced, to be handed in on paper.
You may use any widely accepted paper style (e.g., Chicago, APA,
MLA). The project should begin with a
question whose answer would be of value to the information retrieval
community. The question is best phrased
in the form “Is X better than Y for Z?” rather than “How and why does Z
work?” or “How does X impact Z?” There should be a brief discussion of the
literature addressing areas around the question, possibly citing 3 to 6 related
articles. The question should be clearly
stated in the paper and the paper should focus on answering this question by
drawing conclusions based primarily on the data collected and analyzed. The research should involve either the manual
or automated analysis of data to be gathered by the student (not from the
literature), and it may be either quantitative or qualitative. Studies must focus on more than one system
(or multiple distributed systems) or more than one user; the focus should be on
knowledge and techniques applicable to a wide range of systems and/or
users. Do not base your data analysis
primarily on published data. Implementing a system or software, or planning to
implement such a system, is not acceptable as the course project; you may wish
to perform a study to gain knowledge that might help outside the course to
develop a system, or you might use software you have developed to test out a
hypothesis. The paper should describe
and analyze the results, with an emphasis on interpretation (“why”) leading to
an understanding of the results. Insight
into the strengths and weaknesses of the different techniques or situations is
more important than raw performance improvement. The last paragraph of the paper should
contain specific recommendations for professional practice, as well as
summaries of the reasons for these recommendations.
Criteria for Leadership
Proposals (and Class Participation) Evaluation
This is a required course
for the SILS Master’s degree in Information Science. You are here to learn, not to worry. Anyone who puts in a reasonable effort should
expect to pass the course.
An H paper
includes a question whose answer will improve the operation of more than one
information retrieval system. The paper
should include strong reasons for considering the problem important to ILS
professionals; a brief literature review, and a methods section, as well as a
clear explanation or argument about why these results occurred. The question to be answered should be
topically similar to those questions addressed in journals such as JASIS
and IP&M. An H course
grade indicates clear excellence and leadership in the course.
A P paper is a
good solid piece of work, at the normal graduate level, that may be less
effective in explaining why the question’s answer would be useful or in
connecting it to central issues in the field; or it may lack references to
relevant literature; or it may lack an obvious connection between the question
and the methods to be used; or it may not describe the question or the methodology
precisely; or it may overlook some minor methodological problems or fail to
discuss or resolve them satisfactorily.
There may be little explanation about why these particular results
occurred. P is the most commonly
awarded course grade in graduate level courses such as this.
An L paper may
fail to explain the utility of the research or it may fail to connect the
question to the methods to be used or the different aspects of methods to each
other. Major methodological problems may
have been overlooked. There may be
little or no understanding provided as to the cause of the results.
An F paper is
lacking a required element (the question, relevant literature, research site
and/or sources and/or subjects, data collection and analysis). Any plagiarism or other violation of the
Honor Code will also result in an F and the likelihood of further action.
Each student will develop three informal IR Leadership
proposals. The Leadership proposals
areas and due dates (late proposals penalized!) are
Wed. Oct. 11 Individual
users' information needs, expressions of needs as queries.
Wed. Nov. 8 Univariate feature matching
and term independence, indexing.
Wed. Dec. 13 Multivariate systems,
reasoning systems, natural language processing.
The first 2 proposals are due at the start of class on the day
indicated, and the last proposal is due at noon. Each proposal
should be a total of 2 to 4 pages, single spaced. State clearly what question you are asking,
formulated as an English language question with a question mark at the
end. The proposal should address the
nature of the problem, a discussion of how results and theory in the literature
"support" the problem, methodology, the kinds of results you expect
to find, and the importance of your question and approach. The focus of each proposal needs to be on a
question closely related to the topic for the date, with other information
retrieval system considerations being secondary. Grading will be based upon how well the
proposal addresses the question related to the topic, the usefulness of the
proposed research, its feasibility as a student 3 credit project or master's
paper, and the quality of the proposed methodology. Proposing a small project that leads to
definite knowledge and possible improvement of practice is always better than a
larger project which just amasses data but doesn’t lead to much understanding
or the improvement of practice.
For the first proposal, your question should not discuss or
evaluate a particular information system or information resource. Propose a study of information needs
independent of how the need might be satisfied or how searching for an answer
takes place. You might want to think
about psychological studies of individuals, to learn how needs are formulated,
felt, or expressed, or you might wish to focus on a particular functional group
and their particularly different needs or expressions of needs. If you start writing about how a system
serves people, stop.
For the second proposal, your question should address matters
associated with individual terms, either in the area of indexing or
retrieval. You can address multiple term
systems; however, the terms should be treated as independent of each other (as
do most of the retrieval models discussed up to this point in the course).
For the third proposal, your question should explicitly address
systems using the relationships that exist between document features and
consider how this would impact retrieval performance. Methods of looking at these relationships
might include statistical dependencies, linguistic (syntactic or semantic)
information, or a logical system based on a thesaurus.
Warning: Don’t write on a topic. You should be writing to show how the
methodology will answer the question you provide. If your methodology won’t provide a
definitive (or at least solid) answer to the question, the question may be too
broad and might be narrowed further.
Doing a good job on a professionally relevant but narrow question is
always better than a much weaker answer to a broader question.
Sources of Information on Information Filtering
& Retrieval
Serials:
The major
serials covering IR include Information
Processing and Management (formerly Information
Storage and Retrieval), Journal of
the American Society for Information Science and Technology (formerly JASIS
and before that American Documentation),
Journal of Documentation, IEEE Trans on
Pattern Analysis and Machine Intelligence, IEEE Trans on Date and Knowledge
Engineering, ACM Transactions on
Information Systems, Information Retrieval, and New Review of Document & Text Management.
Conference Proceedings:
The ACM Special Interest Group in Information Retrieval (SIGIR)
has held annual conferences since 1980.
The conference is usually held in North America
in odd years, outside North America even
years. Some European conferences have
been published as "books."
Most of the ACM SIGIR conference proceedings are in the ACM Digital
Library and can be accessed through the library web page.
Monographs:
Baldi and
Brunak, Bioinformatics: The Machine
Learning Approach, MIT, 2001.
Baldi,
Frasconi, and Smyth, Modeling the Internet and the Web, Wiley, 2003.
Baeza-Yates
and Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.
Case,
Donald, Looking for Information: A Survey of Research on Information
Seeking, Needs, and Behavior, Academic Press, 2002.
Chen, Li,
and Wang, Machine Learning and
Statistical Modeling Approaches to Image Retrieval, Kluwer, 2004.
Chu, Heting, Information
Representation and Retrieval in the Digital Age, ASIS, 2003.
Foskett, A.
C., The Subject Approach to Information, London, Lib. Assoc. Publ, 1996.
Forsyth and
Rada, Machine Learning; Applications in
Expert Systems and Information Retrieval, Wiley, 1986.
Frakes and Baeza-Yates, eds., Information Retrieval: Data Structures
& Algorithms, Prentice Hall, 1992.
Frants,
Shapiro, and Voiskunskii, Automated
Information Retrieval, Academic Press, 1997.
Grossman and
Frieder, Information Retrieval:
Algorithms and Heuristics, Second edition, Springer-Verlag, 2004.
Grefenstette,
Cross-Language Information Retrieval, Kluwer, 1998.
Korfhage, Information Storage and Retrieval,
Wiley, 1997.
Kowalski and
Maybury, Information Storage and
Retrieval Systems, Kluwer, 2000.
Langville
and Meyer, Google’s PageRank and Beyond:
The Science of Search Engine Rankings, Princeton,
2006.
Losee, Text Retrieval and Filtering, Kluwer,
1998.
Maybury, M.,
Ed., Intelligent Multimedia Information
Retrieval, AAAI/MIT Press, 1997.
Salton, Automatic Text Processing,
Addison-Wesley, 1989.
Salton and
McGill, Introduction to Modern
Information Retrieval, McGraw Hill, 1983
Sparck Jones
and Willett, Information Retrieval,
Morgan Kaufmann Publishers, 1997.
Van
Rijsbergen, Geometry of Information
Retrieval, Cambridge,
2004.
Van
Rijsbergen, Information Retrieval, Second
Edition, Butterworth, 1979.
Wu, Xiong,
and Shekhar, Clustering and Information
Retrieval, Kluwer, 2004.
Honor Code:
Students
should familiarize themselves with the University of North
Carolina at Chapel Hill Honor Code that is
described in University publications. It
should be noted that in this course, students are expected to receive (and
provide) some assistance regarding the use of hardware and software in the
laboratories and general problem solving techniques for homework assignments. Students should NOT receive (or provide)
major creative assistance or continuing minor support for projects.
Plagiarism:
Student
assignments that are handed in that contain more than 5 consecutive words that
the instructor feels were taken from another source without proper attribution
(without the proper quote marks and citations) definitely will be referred
to the appropriate administrative authorities who address issues of Academic
Integrity (e.g. the Honor Court)
I assume that all students are equally likely to be honest and will put
an equal amount of effort into considering the possibility of plagiarism for
each student’s paper.
Classroom Behavior:
Separate
from the Honor Code but related to respect for classmates is classroom
behavior, which will be a factor in your class participation grade. Students are expected to behave in a
professional manner in class. Students
in class are expected to focus on classroom materials. Students are expected to avoid
student-to-student conversations during class.
Use of laptop computers should be limited to taking notes for class
and to using class related materials.
Similarly, materials being read should be limited to those appropriate
for the classroom lecture or discussion.
Students who appear to be involved in non-class related activities
during class time will be graded as not participating in class. Cellular telephones and computers should have
speakers or other audio devices muted before class begins so as to not disturb
others.