NSF Government Information Workshop White Paper

Public Access and Use of Government Statistical Information

Gary Marchionini and Stephan Greene
University of Maryland at College Park

{march, stephang}@oriole.umd.edu

Federal Government Statistical Information

Government statistical information is essential to the day-to-day lives of all citizens. The importance of such data is illustrated by the efforts of multiple federal government agencies to create the National Statistical Information Infrastructure. Data from agencies such as Bureau of Labor Statistics, Census Bureau, and Bureau of Economic Analysis determine costs of everything from apples to zinc, the locations of new businesses, and the indexes for all government programs and payments. Web-based technologies offer citizens broader access to the vast array of statistical data so that they may make better personal decisions (e.g., baby-boomers planning for retirements, unemployed or underemployed looking to relocate, school children exploring careers). For broader segments of the citizenry to take advantage of government statistical information, however, the data must both be easy to find and easy to interpret and use. Sites that provide government statistics cannot assume users access the data frequently enough to learn arcane codes and complex search strategies, nor that users have high levels of statistical literacy. Ease of search in this setting is dependent on helping users articulate needs (the traditional search interface challenge), on distributing these articulations to different datasets across the federal government and unifying the results (an interoperability challenge), and presenting results in forms most useful to user needs (a knowledge representation challenge). Ease of interpretation and use is dependent on appropriate (possibly customized to the users' needs) representations that include on-demand documentation for the statistics represented, and on techniques that help users avoid misusing data (e.g., dividing two indexes or ratios).

We view this as a human-system interface problem that has two levels of complexity. The first level is the interface the user sees and interacts with--as with all UIs, this must be clear, concise, and consistent, and give users opportunities to directly manipulate data objects in obvious and effective ways. Such interfaces must be rooted in theory and empirical evidence related to information seeking processes as applied to statistical data. Thus, the most critical need at this level is for research on users, specifically, how they think about statistical problems, and how they seek and interpret statistical data. Although we have long been actively engaged in developing general theories of information seeking, the specific characteristics of statistical data and the average citizen's statistical literacy demand investigation so that the first level interface challenge may be met--as illustrated by the following questions. How are needs for statistical data different from needs for linguistic data? How do novices think about statistical information needs and articulate those needs? How can exploratory data analysis techniques (e.g., Tukey, 1975) be enhanced in electronic environments?

The second level of interface is less apparent to the end user but essential to effective use of statistical data--this organizational level integrates metadata and data at all phases of interaction for three purposes: to provide instant documentary information to users on demand, to facilitate interoperability of search across distinct data sets, and to provide a basis for checking user operations for statistical legitimacy and alerting users to possible misuses. Thus, the most critical need at this level is for rich data models that support these requirements.

We believe that for government statistical repositories to be widely and accurately used, what is needed is a plan of research that investigates user information-seeking strategies in statistical databases, develops appropriate data models to support these strategies, and develops and tests prototype interfaces for government statistical data that unite the user needs and data model.

To illustrate one approach to this need, we describe a plan of work that combines three threads of work: extend our iterative, decision-theoretic framework for human information seeking (Marchionini, 1995) to statistical data accessed by typical citizens; build upon a backend database model for social science statistical data that promotes statistical metadata to a distinct metadatabase level (Greene, 1996), and apply our experience in user-centered interface design to unite user needs and rich data models in interface designs that dynamic query interface styles of rapid query refinement and result set visualization.

Thread 1. Information seeking is viewed as a dynamic and natural set of processes that ranges from analytical search where formal techniques are applied to extract relevant information (e.g., database queries, online searching by professional intermediaries, most of the current web-based search engine searches) to exploratory browsing techniques that are guided as much by the information landscape as the user need (e.g., most web surfing). We are currently engaged in investigating user information seeking behaviors in the Bureau of Labor Statistics (BLS) and Current Population Survey (CPS) web sites and the forthcoming One-Stop-Shopping (OSS) for federal statistics site. Preliminary results suggest that users need better tools for selective access (data extraction tools for microdata) and better front ends and descriptive data for macrodata such as time series for a host of variables. These requirements will become even more critical in the OSS site as users are referred to a variety of statistical sites in different agencies. Additionally, OSS demands interoperability at the interface as well as data structure levels. What is emerging is a user-task taxonomy for BLS and CPS sites that characterizes user types (e.g., journalist, economist, student, etc.) by need-based tasks (e.g., verify data, extract single value, determine trend, explore pattern, etc.). We view this as a more focused and elaborated taxonomy than the more general taxonomies previously proposed (e.g., Sundgren, 1996). What is urgently needed is extension of this taxonomy to the general public who will continue to develop web interests but may not have statistical backgrounds to either find or interpret statistical data. In addition to the investigations of the cognitive aspects of novice use of statistical data (e.g., What mental models do people have for surveys, indexes, time series, etc.?), there are many social and political implications to such access (e.g., recent New York Times article on citizen access to political donations, the recurring debate on changing how the Current Price Index is computed, etc.) and it is essential that government agencies not only make statistical data available but also help citizens understand and ethically interpret these data.

Thread 2. The potential for misusing statistical data is well known (e.g., Greenstein, 1994) and better data models have been proposed as solutions (e.g., Michalewicz, 1991; Rafanelli, 1988). Statistical metadata representation as a metadatabase closely coupled to the primary data was part of Greene's Master's thesis at UMCP (Greene, 1996). The goal of this work is to support the dynamic manipulation of data by end users in a manner that preserves data integrity and generates new metadata that continues to provide description for measures derived under user control. This metadata representation is processed in tandem with the primary data and can be used to support interoperability and help avoid incorrect statistical manipulation. New levels of interactive data services can then be safely supported for novice users. It is useful to note that such a data model provides possibilities beyond a thesaurus that helps users pose queries to support alternative data sets or views for users (e.g., informing the user that the database does not have income data for Hispanics in Phoenix but can provide the data for Arizona). Such data models will then support interface widgets such as sliders for time or geographic maps to explore similar data. Additionally, such models will allow the interface to alert users to questionable data combinations (e.g., adding income in Detroit and wages in Pittsburgh). A fully realized system for dynamic metadata can make possible safe and meaningful dynamic data services for non-expert users of statistical information.

Thread 3. The key to citizen access and use is clear and understandable interfaces. Results from thread one research should inform interface design that provides guidance for users and goes beyond retrieval to help them in understanding and interpreting statistics. We believe that such interfaces should provide guided tours, close coupling of queries and data through visualization techniques, and use of metadata to minimize misleading operations. Guided tours should be based on the user task taxonomy and may be instantiated as customized interface modalities. Dynamic queries (DQ) are an innovative visualization technique developed at the Human-Computer Interaction Laboratory at the University of Maryland to closely couple user interests and data through visualizations such as starfield or barfield displays (Shneiderman, 1994). DQs have been applied to medical history data (Plaisant et. al, 1995), juvenile justice data (Greene & Rose, 1996), NASA datasets (Tanin et. al, 1996), educational multimedia resources (Marchionini et. al., 1997), and to the data collections of the Library of Congress National Digital Library Program (Plaisant et. al., 1997). Statistical data lends itself well to DQ solutions since the data elements are often at least intervally scaled and easily mapped to graphical displays and filtering widgets such as sliders. Interface designs that allow users to explore data rather than posing well-specified queries are appropriate alternatives for casual or novice users. Error checking (e.g., alerting users to illogical data unit combinations) and support for queries and interpretations (e.g., code dictionaries and concept thesauri) should be based on robust data models emerging from research on thread two.

Workshop questions:

1. The application domain is government statistical data as exemplified by the National Statistical Information Infrastructure.

2. There are several barriers that must be overcome for such successful deployment. First, the difficulties related to understanding human cognitive activity through empirical methods will continue to challenge research on information seeking strategies. Determining how people think about problems that require statistical information and what mental models they have for fundamental methods and representations requires extensive empirical studies with a variety of individuals. This is time consuming and highly subjective work. Second, robust data models depend on standards or on good mappings among different coding schemes. The coding schemes used for statistical data are agency and sometimes program within an agency specific. For example, instead of a single county code in Current Population Survey data, there are various codes that users must enter to include all counties since the coding scheme is based on legislation codes rather than federal identity place standard (FIPS) codes. Although there are efforts to update and standardize such codes, there is enormous historical precedent and installed base of expertise that makes change difficult. A high priority activity will be to develop mapping schemes for simple units (e.g., geographic place ontology, monetary unit ontology, population ontology) and glossaries for more complex units (e.g., income, demographic categories). As these definitions may have political implications (e.g., Congressional districting, municipal bond ratings, government program qualifications, etc.), these mapping schemes and glossaries will require discussion and comment at multiple levels within agencies and across the government. Third, the sheer size of the government agencies that produce statistical data yields a coordination challenge. Even if all agencies are highly committed to collaboration, the many meetings, committees, and memoranda necessary for such a large-scale effort will take considerable time and effort.

3. There are three key technology enablers for such interfaces. First, faster web access is necessary for highly interactive dynamic query interfaces. Second, DQ technology must be scaled up to very large databases. Starfield and barfield displays are excellent for collections in the several thousand item range, but we must develop DQ techniques for data sets many orders of magnitude larger. We have has some success with hierarchical DQ displays that give orders of magnitude improvements (e.g., Tanin & Shneiderman, 1996) and must develop additional techniques. Statistical data sets offer new and highly feasible opportunities for such development. Third, research in ontology development (e.g., Sowa, 1995) must be expanded to concepts such as measures and geopolitical boundaries.

4. We envision a user need assessment effort that provides a basis for a prototype interface using data from two or three agencies in a 24 month period. This prototype will certainly not have all the coding maps included but will be sufficient for formal usability testing and interagency critique. A six month testing and critique period would then enable a multiple year (2-4) effort of implementation and continued investigation of information seeking strategies of people using statistical data.

5. This will require a large and distributed effort across several government agencies and a more modest but more focused development effort on the part of university and government researchers. Each agency would provide a representative to serve on a steering committee and act as liaison. Each agency in the outyears would also commit one or more technical staff to preparing and maintaining data. One or more core development team of 4-6 staff would be responsible for developing and testing the prototype and conducting user studies. These staff should be drawn from both university and government and be supported with graduate research assistants. In the implementation years, more staff will be necessary to coordinate the multiple agency efforts at implementation.

Authors' qualifications and background.

Gary Marchionini, Professor of Information Science, University of Maryland at College Park College of Library and Information Services and Member of the Human-Computer Interaction Laboratory. http://www.glue.umd.edu~march Information seeking strategies of people using electronic databases and designing interfaces to support such strategies have been primary foci of Marchionini's research.

Stephan Greene, Faculty Research Assistant, University of Maryland at College Park Human-Computer Interaction Laboratory. http://www.cs.umd.edu/projects/hcil Developing techniques to link metadatabases to primary data was the research focus of Stephan Greene's thesis. He was one of the programmer analysts for the Great American History Machine that allows users to easily investigate US History through Census data.

Marchionini & Greene have been collaborating on assessing user information-seeking strategies for BLS, CPS, and OSS statistical sites to make recommendations for web site interface improvements. Both have developed DQ interfaces for various large-scale digital library efforts (State of Maryland Juvenile Justice Department; Library of Congress National Digital Library Program; Baltimore Learning Community).

References

Greene, S. (1996). The design and use of metadata for statistical social science databases. Master's Thesis, University of Maryland at College Park.

Greene, S. & Rose, A. (1996). Process change from user requirements elicitation: A case study of documents in a social services agency. Proceedings of IPIC '96, International Working Conference on Integration of Enterprise Information and Processes-"Rethinking Documents", Sloan School of Management, MIT, Cambridge, MA, November 14-15, 1996. also, University of Maryland CS-TR-3638 , CAR-TR-827.

Greenstein, D. (1994). A historian's guide to computing. Oxford: Oxford U. Press.

Marchionini, G. (1995). Information seeking in electronic environments. NY: Cambridge U. Press.

Marchionini, G., Nolet, V., Williams, H., Ding, W., Rose, A., Beale, J., Enomoto, E., Harbinson, L., & Gordon, A. (in press). Content+Connectivity=Community: Digital Resources for a Learning Community. Proceedings of ACM International Conference on Digital Libraries (Philadelphia, July 24-26, 1997).

Michalewicz, Z. (Ed.) (1991). Statistical and scientific databases. Westchester, UK: Ellis Horwood Limited.

Plaisant, C., Marchionini, G., Bruns, T., Komlodi, A., & Campbell, L. (1996) Bringing Treasures to the Surface: Iterative design for the Library of Congress National Digital Library Program. To appear in CHI 97 Proceedings, Atlanta GA, 22-27 March 1997, ACM New York, also University of Maryland CS-TR-3694

Plaisant, C., Milash, B., Rose, A., Widoff, S., & Shneiderman, B. (1995). Life Lines: Visualizing personal histories. ACM CHI '96 Conference Proc. (Vancouver, BC, Canada, April 13-18, 1996) 221-227, color plate 518. Video abstract: ACM CHI '96 Conference Companion (Vancouver, BC, Canada, April 13-18, 1996) 392-393. Video available trough ACM. also University of Maryland CS-TR-3523, CAR-TR-787, ISR-TR-95-88

Rafanelli, M. (1988). Research topics in statistical and scientific database management: The IV SSDBM. In Lecture notes in computer science 339 (Eds. M. Rafanelli, J. Klensin, & P. Svensson), 1-18. Berlin: Springer-Verlag.

Shneiderman, B. (1994). Dynamic Queries: for visual information seeking. IEEE Software, 11(6) (Nov). 70-77.

Sowa, J. (1995). Knowledge representation: Logical, philosophical, and computational foundations. Boston: PWS Publishing (preprint).

Sundgren, B. (1996). Making statistical data more available. International Statistical Review, 64(1), 23-38.

Tanin, E., Beigel, R., & Shneiderman, B. (1996). Incremental Data Structures and Algorithms for Dynamic Query Interfaces, Workshop on New Paradigms in Information Visualization and Manipulation, Fifth ACM International Conference on Information and Knowledge Management CIKM '96 (Rockville, MD, Nov. 16, 1996, 12-15), also University of Maryland CS-TR-3730

Tukey, J. (1977). Exploratory data analysis. Reading, MA.: Addison-Wesley.