inls 461
Information Tools

Professor: Serena Fenton
School of Information and Library Science at UNC-Chapel Hill

Assignment : UNIX 2 - Tools

The purpose of this assignment is to allow practice with some typical UNIX tools in a simple data analysis task.

The product of this assignment will be a Wordpad .RTF document containing various text and screenshots as noted below.
Submit through Blackboard.

This assignment also uses the data from the SyskillWebert.tar.gz file.

  • Extracting from this file results in a directory SW that contains four subdirectories: Bands, BioMedical, Goats, and Sheep.
  • Each of those subdirectories contains a number of files with names made up of digits and sometimes a "-" character, and one more file named index that contains information about the other files in that subdirectory.
  • Each index file contains one line of text for each of the other files in its subdirectory, with the following format:
    file-name | rating | url | date-rated | title
  • Where rating is either "hot", "medium", or "cold". (Note: there are no spaces between the "|" characters and the fields they separate; and the other information does not itself contain any "|" characters.)

Open Wordpad and start your document to record the results of this assignment. Remember to save this document as .rtf

Log onto UNIX using your isis account. You can either

  • Create a new directory specifically for this Project, and extract the SyskillWebert.tar.gz data into the new directory.
  • Or use the same directory you prepared for Project 2.

In your SSH window, at the UNIX command line:

  • Use a single UNIX wc command, possibly with wildcard(s) in its argument(s), to determine the number of entries (lines) in each of the index files, and the total number of those entries in all of the index files.
  • Use a single UNIX grep command, possibly with wildcard(s) in its argument(s), to identify all of the "medium" rated entries in all directories. (You should find 11 of these.)

Take a screenshot of your SSH window at this point, and place it in your Wordpad document as evidence that you have done this step. Make sure that the text in this screenshot is legible, and shows both the commands and the output for the wc and grep commands above.

In your SSH window, still at the UNIX command line:

  • Use one or more UNIX grep commands, pipelined into a UNIX wc command, to count the total number of "hot" rated pages in all directories.
    (Warning: one of the entries contains a "hotwired.com" URL. Be sure not to count this as a "hot" rated page unless it really is "hot" rated.)
  • Your answer should be 93. Note that if any of your options or arguments contain characters with special meaning to the shell, you will have to quote those characters (or the options and/or arguments that contain them) in order to get the shell to pass the special characters into your command(s).

Take another screenshot of your SSH window at this point, and place it in your Wordpad document as evidence that you have done this step. Make sure that the text in this screenshot is legible, and shows both the grep-to-wc pipeline and its output.

Grading:

  • (10 point) Correct wc command to count all index entries.
  • (10 point) Correct grep command to identify all "medium" rated index entries.
  • (5 point) Legible screenshot of the results above.
  • (20 points) Correct grep-to-wc pipeline.
  • (5 point) Legible screenshot of the results.

 


revised September 1, 2006