SILS, U. of North
Carolina, Chapel Hill
INLS-509 -- Information
Retrieval
Bob Losee
Manning 302
962-7150
losee at unc dot edu
Spring 2008
Brief
Description:
An
introductory survey of information filtering and retrieval, with an emphasis on
developing the student's understanding of the relationship between the
algorithms used by search engines, the query and document, and system
performance. This is an information science course, not an information
technology course. The course is required for students in the School’s
Master’s in Information Science program and will emphasize basic knowledge
useful for those who will be in leadership positions in a wide range of
information professions.
Course WWW
links:
http://InformationRetrieval.US
(if you forget, there is link from my home
page)
Course Outline
Readings below are required except for those preceded by
an asterisk (*) Note that students are never expected to absorb all the
material or understand all the mathematics in the articles.
Introduction:
Retrieval and Filtering
Losee, Lectures Notes (available in bookstore), Chapter 1.
Sparck-Jones and Willett, Readings in Information
Retrieval ("RIR" below), Morgan Kaufmann Publishers,
1997. Chapter 1.
* Baeza-Yates and Ribeiro-Neto, Chapters 4, 10
* Case, Donald, Looking for Information: A Survey of
Research on Information Seeking, Needs, and Behavior, Academic Press,
2002.
* Sugar, “User-centered Perspectives of Information
Retrieval Research and Analysis Methods,” Annual Review of Information
Science and Technology, 1995, 77-109.
Probability
Losee, Lecture Notes, Chapter 2.
Students may wish to consult one or more of the
"management science" books in the UNC libraries.
Indexing,
Document, and Media Representation
Losee, Lecture Notes, Chapter 3
RIR, Chapter 2, articles by Joyce and Needham (p. 15);
Luhn (p. 21); Doyle (p. 25); Cleverdon (p. 47); Salton and Lesk (p. 60.)
* Iivonen and Sonnenwald, “From Translation to Navigation
of Different Discourses: a Model of Search Term Selection during the Pre-online
Stage of the Search Process,” Journal of the American Society for
Information Science, 49 (Apr. 1 '98), 312-26.
* Svenonius, "Access to Nonbook Materials: The Limits
of Subject Indexing for Visual and Aural Languages," Journal of the
American Society for Information Science, 45(8) Sept. 94, 600-606.
* Salton and McGill, Introduction to Modern Information
Retrieval, McGraw-Hill, 1983, Chapter 3.
* Salton, Automatic Text Processing,
Addison-Wesley, 1989, Chapter 9.
Retrieval Performance
RIR, Chapter 3, article by Saracevic (p. 143.)
RIR, Chapter 4, articles by Saracevic, Kantor, Chamis, and
Trivison (p. 175); Cooper (p. 191); Tague-Sutcliffe (p. 205); Keen (p. 217.)
* Baeza-Yates and Ribeiro-Neto, Chapter 3.
Losee, Lecture Notes, Chapter 4.
* Losee, Lecture Notes, Chapter 6.
* Van Rijsbergen, Information Retrieval, 2nd ed.,
Butterworths, 1979, Chapter 7.
Similarity and Retrieval Decisions
RIR, Chapter 5, articles by Cooper(p. 265); Belkin, Oddy,
and Brooks (p. 299.)
RIR, Chapter 6, articles by Salton and Buckley
(p. 355); Croft and Harper (p. 339.)
RIR, Chapter 7, article by Tenopir and Cahn
(p. 446.)
Losee, Lecture Notes, Chapter 5
* Van Rijsbergen, Chapters 5 & 6.
Relationships
between Terms, Natural Language Processing
Losee, Lecture Notes, Chapter 8, 9, 11.
RIR, Chapter 5, article by Turtle and Croft (p. 287.)
RIR, Chapter 6, article by Porter (p. 313.)
RIR, Chapter 8, articles by Salton, Allan, Buckley and
Singhal (p. 478); Rau (p. 527); Johnson, Paice, Black, and Neal (p. 538.)
* Chowdhury, “Natural Language Processing,” in Annual
Review of Information Science and Technology, 2003.
Rule Based and Logical Systems
Losee, Lecture Notes, Chapter 10.
* Forsyth and Rada, Machine Learning: Applications in
Expert Systems and Information Retrieval, Wiley, 1986, Chapters 6-14.
Coding and Compression
* Salton, 1989, Chapters 5 & 6.
* Losee, Science of Information, 1990,
Chapter 2.
Course Evaluation:
Quality of
class participation 40%
Critiques of readings 30%
Other homework 30%
Critiques of Readings:
For some
articles listed on the course schedule, students are expected to write a critique
of the article of 5 to 10 sentences in length (maximum ¾ page single spaced, 1
page double spaced) and hand in the critique (on paper, not via email, and use
serif fonts for the body of the text) by the beginning of the class on the due
date listed on the schedule.
The critiques should be constructive, emphasizing ways that the research
could be improved or expanded, and might include questions that arose as you
read the article whose answers would be useful, possible research questions
that could be turned into (and are focused enough and small enough to be) SILS
Master’s papers, along with methodologies for addressing these questions. Do
not criticize the author’s writing style or the choice of topic; emphasize how
you personally might be able to expand on the article. The one lowest critique
grade will be dropped, to cover “bad days,” critiques that don’t get handed in
on-time, or sickness.
End of the
Semester Proposal
By noon April
25, the Friday of the last week of class, students will hand in a printed research
proposal based upon one of the critiques they wrote during the semester. The
proposal should contain a clearly stated research hypothesis in the first
paragraph. This research proposal should be 3 to 6 pages, single spaced, and
should include enough discussion of the related literature to base the
different aspects or components of your research hypothesis and how it would be
answered in the context of the research literature.
Information Retrieval Leadership Proposals:
Each student will develop three Information Retrieval
Leadership Proposals. The Leadership Proposal areas (due dates for printed
proposals are on the class
schedule) are
Proposal 1: Expressions of information needs as queries
by individuals or groups; query languages; means for eliciting information
needs.
Proposal 2: Univariate (statistically independent) feature, document, and
query matching and similarities, assuming term independence; indexing (as
viewed from retrieval).
Proposal 3: Multivariate similarity or matching systems; multivariate
reasoning systems; natural language processing.
The proposals are due at the start of class on the day
indicated. Each proposal should be a total of 2 to 4 pages, single spaced. Do
not use a sans serif font; these fonts (e.g. Helvetica or Arial) are designed
for headlines and captions, not the body of text in a paper. As the title for
each paper, state clearly what question you are asking, formulated as an
English language question with a question mark at the end. The proposal
should address the nature of the problem, a discussion of how results and
theory in the literature "support" the problem, methodology, the
kinds of results you expect to find, and the usefulness of the answer to your
question. The question and its answer should address issues bigger than found
at one site or one system or one language; the most useful questions are
generic questions that are of the form “is X better than Y?” Select a question
whose answer would make you a leader in IR by suggesting ways people should
make decisions differently or better. Descriptive studies are acceptable but
always considered less useful than constructive studies that make concrete
recommendations. The focus of each proposal needs to be on a question closely
related to the topic for the date, with other information retrieval system
considerations being secondary. Grading will be based upon how well the
proposal addresses the question related to the topic, the usefulness of the
proposed analysis, how answering the question is feasible as a student 3 credit
project or master's paper, and the quality of the proposed methodology at
answering the question. Proposing a small project that leads to definite
knowledge and possible improvement of practice is always better than a larger
project which just amasses data but doesn’t lead to much understanding and the
improvement of practice.
For the first proposal, your question should not discuss
or evaluate a particular information system or information resource, or the use
of a system or systems by users. Propose a study of information needs
independent of how the need might be satisfied or how searching for an answer
takes place. You may look at information use, but only as a way to study the
focus of this proposal, information need. You might want to think about
psychological studies of individuals, to learn how needs are formulated, felt,
or expressed, or you might wish to focus on a particular functional group and
their particularly different needs or expressions of needs. If you start
writing about how a system serves people or how people search for information,
stop.
For the second proposal, your question should address
matters associated with individual terms, either in the area of indexing or
retrieval. You can address multiple term systems; however, the terms should be
treated as independent of each other (as do most of the retrieval models
discussed up to this point in the course).
For the third proposal, your question should explicitly
address systems using the relationships that exist between document features
and consider how this would impact retrieval performance. Methods of looking at
these relationships might include statistical dependencies, multivariate
machine learning techniques, linguistic (syntactic or semantic) information, or
a logical system based on a thesaurus.
Warning: Don’t write on a topic. You should be
writing to show how the methodology will answer the question you provide. If
your methodology won’t provide a definitive (or at least solid) answer to the
question, the question may be too broad and might be narrowed further. Doing a
good job on a professionally relevant but narrow question is always better than
a much weaker answer to a broader question. Each question-answer combination
should show how to lead the field of information retrieval.
Each student is expected to conduct
a small research project and write up the project in a paper of 4 to 10 pages
of text, single spaced, to be handed in on paper. You may use any widely
accepted paper style (e.g., Chicago, APA, MLA). The project should begin with
a question whose answer would be of value to the information retrieval
community. The question is best phrased in the form “Is X better than Y for
Z?” rather than “How and why does Z work?” or “How does X impact Z?” There
should be a brief discussion of the literature addressing areas around the
question, possibly citing 3 to 6 related articles. The question should be
clearly stated in the paper and the paper should focus on answering this
question by drawing conclusions based primarily on the data collected and
analyzed. The research should involve either the manual or automated analysis
of data to be gathered by the student (not from the literature), and it may be
either quantitative or qualitative. Studies must focus on more than one system
(or multiple distributed systems) or more than one user; the focus should be on
knowledge and techniques applicable to a wide range of systems and/or users.
Do not base your data analysis primarily on published data. Implementing a
system or software, or planning to implement such a system, is not acceptable
as the course project; you may wish to perform a study to gain knowledge that
might help outside the course to develop a system, or you might use software
you have developed to test out a hypothesis. The paper should describe and
analyze the results, with an emphasis on interpretation (“why”) leading to an
understanding of the results. Insight into the strengths and weaknesses of the
different techniques or situations is more important than raw performance
improvement. The last paragraph of the paper should contain specific
recommendations for professional practice, as well as summaries of the reasons
for these recommendations.
Criteria for Leadership
Proposals (and Class Participation) Evaluation
This is a required course for the SILS Master’s degree in
Information Science. You are here to learn, not to worry. Anyone who puts in
a reasonable effort should expect to pass the course.
An H paper includes a question whose answer will improve
the operation of more than one information retrieval system. The paper should
include strong reasons for considering the problem important to ILS
professionals; a brief literature review, and a methods section, as well as a
clear explanation or argument about why these results occurred. The
question to be answered should be topically similar to those questions
addressed in journals such as JASIS and IP&M. An H
course grade indicates clear excellence and leadership in the course.
A P paper is a good solid piece of work, at the normal
graduate level, that may be less effective in explaining why the question’s
answer would be useful or in connecting it to central issues in the field; or
it may lack references to relevant literature; or it may lack an obvious
connection between the question and the methods to be used; or it may not describe
the question or the methodology precisely; or it may overlook some minor
methodological problems or fail to discuss or resolve them satisfactorily.
There may be little explanation about why these particular results occurred. P
is the most commonly awarded course grade in graduate level courses such as
this.
An L paper may fail to explain the utility of the research
or it may fail to connect the question to the methods to be used or the
different aspects of methods to each other. Major methodological problems may
have been overlooked. There may be little or no understanding provided as to
the cause of the results.
An F paper is lacking a required element (the question,
relevant literature, research site and/or sources and/or subjects, data
collection and analysis). Any plagiarism or other violation of the Honor Code
will also result in an F and the likelihood of further action.
Each student will develop three informal IR Leadership
proposals. The Leadership proposals areas and due dates (late proposals
penalized!) are
Wed. Oct. 11 Individual users' information needs,
expressions of needs as queries.
Wed. Nov. 8 Univariate feature matching and term independence, indexing.
Wed. Dec. 13 Multivariate systems, reasoning systems, natural language processing.
The first 2 proposals are due at the start of class on the
day indicated, and the last proposal is due at noon. Each proposal should be a
total of 2 to 4 pages, single spaced. State clearly what question you are
asking, formulated as an English language question with a question mark
at the end. The proposal should address the nature of the problem, a
discussion of how results and theory in the literature "support" the
problem, methodology, the kinds of results you expect to find, and the
importance of your question and approach. The focus of each proposal needs to
be on a question closely related to the topic for the date, with other
information retrieval system considerations being secondary. Grading will be
based upon how well the proposal addresses the question related to the topic,
the usefulness of the proposed research, its feasibility as a student 3 credit
project or master's paper, and the quality of the proposed methodology.
Proposing a small project that leads to definite knowledge and possible
improvement of practice is always better than a larger project which just
amasses data but doesn’t lead to much understanding or the improvement of
practice.
For the first proposal, your question should not discuss
or evaluate a particular information system or information resource. Propose a
study of information needs independent of how the need might be satisfied or
how searching for an answer takes place. You might want to think about
psychological studies of individuals, to learn how needs are formulated, felt,
or expressed, or you might wish to focus on a particular functional group and
their particularly different needs or expressions of needs. If you start
writing about how a system serves people, stop.
For the second proposal, your question should address
matters associated with individual terms, either in the area of indexing or
retrieval. You can address multiple term systems; however, the terms should be
treated as independent of each other (as do most of the retrieval models
discussed up to this point in the course).
For the third proposal, your question should explicitly
address systems using the relationships that exist between document features
and consider how this would impact retrieval performance. Methods of looking
at these relationships might include statistical dependencies, linguistic
(syntactic or semantic) information, or a logical system based on a thesaurus.
Warning: Don’t write on a topic. You should be
writing to show how the methodology will answer the question you provide. If
your methodology won’t provide a definitive (or at least solid) answer to the
question, the question may be too broad and might be narrowed further. Doing a
good job on a professionally relevant but narrow question is always better than
a much weaker answer to a broader question.
Sources of Information on Information Filtering & Retrieval
Serials:
The major
serials covering IR include Information Processing and Management
(formerly Information Storage and Retrieval), Journal of the American
Society for Information Science and Technology (formerly JASIS and
before that American Documentation), Journal of Documentation, IEEE
Trans on Pattern Analysis and Machine Intelligence, IEEE Trans on Date and
Knowledge Engineering, ACM Transactions on Information Systems, and Information
Retrieval.
Conference Proceedings:
The ACM Special Interest Group in Information Retrieval
(SIGIR) has held annual conferences since 1980. The conference is usually held
in North America in odd years, outside North America even years. Some European
conferences have been published as "books." Most of the ACM SIGIR
conference proceedings are in the ACM Digital Library and can be accessed
through the library web page.
Monographs:
(** Best works
or classics marked with asterisks)
Baldi and
Brunak, Bioinformatics: The Machine Learning Approach, MIT, 2001.
Baldi,
Frasconi, and Smyth, Modeling the Internet and the Web, Wiley, 2003.
Baeza-Yates
and Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.
Case, Donald, Looking
for Information: A Survey of Research on Information Seeking, Needs, and
Behavior, Academic Press, 2002.
Chen, Li, and
Wang, Machine Learning and Statistical Modeling Approaches to Image
Retrieval, Kluwer, 2004.
Chu, Heting, Information Representation and Retrieval in the Digital Age, ASIS,
2003.
Feldman, R.
and Sanger, J. The Text Mining Handbook: Advanced Approaches in Analyzing
Unstructured Data. Cambridge U. Press, 2006.
** Foskett, A.
C., The Subject Approach to Information, London, Lib. Assoc. Publ, 1996.
Forsyth and
Rada, Machine Learning; Applications in Expert Systems and Information
Retrieval, Wiley, 1986.
** Frakes and Baeza-Yates, eds., Information
Retrieval: Data Structures & Algorithms, Prentice Hall, 1992.
Frants,
Shapiro, and Voiskunskii, Automated Information Retrieval, Academic
Press, 1997.
Grossman and
Frieder, Information Retrieval: Algorithms and Heuristics, Second
edition, Springer-Verlag, 2004.
Grefenstette, Cross-Language
Information Retrieval, Kluwer, 1998.
Korfhage, Information
Storage and Retrieval, Wiley, 1997.
Kowalski and
Maybury, Information Storage and Retrieval Systems, Kluwer, 2000.
Langville and
Meyer, Google’s PageRank and Beyond: The Science of Search Engine Rankings,
Princeton, 2006.
Losee, Text
Retrieval and Filtering, Kluwer, 1998.
Manning,
Raghaven, and Schutze. Introduction
to Information Retrieval, Cambridge, 2008.
** Manning and
Schutze, Foundations of Statistical Natural Language Processing, MIT Press,
1999.
Maybury, M.,
Ed., Intelligent Multimedia Information Retrieval, AAAI/MIT Press, 1997.
Salton, Automatic
Text Processing, Addison-Wesley, 1989.
** Salton and
McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983
Sparck Jones
and Willett, Information Retrieval, Morgan Kaufmann Publishers, 1997.
Van
Rijsbergen, Geometry of Information Retrieval, Cambridge, 2004.
** Van
Rijsbergen, Information Retrieval, Second Edition, Butterworth, 1979.
Voorhees, E. and
Harman, D. TREC: Experiment and Evaluation in Information Retrieval,
MIT, 2005.
Wu, Xiong, and
Shekhar, Clustering and Information Retrieval, Kluwer, 2004.
Honor Code:
Students
should familiarize themselves with the University of North Carolina at Chapel
Hill Honor Code that is described in University publications. It should be
noted that in this course, students are expected to receive (and provide) some
assistance regarding the use of hardware and software in the laboratories and
general problem solving techniques for homework assignments. Students should
NOT receive (or provide) major creative assistance or continuing minor support
for projects.
Plagiarism:
Student
assignments that are handed in that contain more than 5 consecutive words that
the instructor feels were taken from another source without proper attribution
(without the proper quote marks and citations) definitely will be referred
to the appropriate administrative authorities who address issues of Academic
Integrity (e.g. the Honor Court) I assume that all students are
equally likely to be honest and will put an equal amount of effort into
considering the possibility of plagiarism for each student’s paper.
Classroom Behavior:
Separate from
the Honor Code but related to respect for classmates is classroom behavior,
which will be a factor in your class participation grade. Students are
expected to behave in a professional manner in class. Students in class are
expected to focus on classroom materials. Students are expected to avoid
student-to-student conversations during class. Use of laptop computers
should be limited to taking notes for class and to using class related
materials. Similarly, materials being read should be limited to those
appropriate for the classroom lecture or discussion. Students who appear to be
involved in non-class related activities during class time will be graded as
not participating in class. Cellular telephones and computers should have
speakers or other audio devices muted before class begins so as to not disturb
others.