next up previous
Next: Performance Superiority over a Up: Measuring Search Engine Quality Previous: System Quality and Query

Comparing Retrieval or Search Engines


 
Table 1: Q values for full retrieval for the 6 different retrieval engines.
  Qi for queries   . 
i Engine 1-100 1-50 .51-100 . 
1 Boolean 1 .000 1 .000 1 .000
2 Freestyle1 0 .871 0 .868 0 .873
3 Freestyle2 1 .000 1 .000 1 .000
4 Freestyle3 0 .913 0 .919 0 .907
5 Target1 0 .744 0 .751 0 .738
6 Target2 0 .956 0 .927 0 .983

The data in Table 1 contain the set of Q values that are obtained when full retrieval is used, with no cutoffs. We notice that the values for Boolean and for Freestyle2 show that the engines appear to be optimal at this point, with Target2 being a somewhat lower performing engine. The columns on the right side of Table 1 show the Q values for the first and the second 50 queries, showing the relative robustness of the Q values. The quality or difficulty associated with retrieving documents for each query is provided by the query-specific scores provided in Table 2. Interestingly, some queries show that relevant documents are easily moved to the front of the list of ranked documents (e.g. queries 9 through 11) while the A values for other queries, such as 8 and 24, represent situations where it is far more difficult to discriminate between relevant and non-relevant documents.
 
Table 2: The set of A values obtained with retrieval of all documents.
Query i Ai . Query iAi . Query iAi . Query iAi . 
    1    .020     26    .075     51    .345     76    .000
2    .500 27    .048 52    .000 77    .087
3    .087 28    .000 53    .000 78    .199
4    .064 29    .240 54    .044 79    .080
5    .102 30    .000 55    .187 80    .002
6    .027 31    .074 56    .000 81    .104
7    .122 32    .023 57    .076 82    .106
8    .326 33    .109 58    .151 83    .165
9    .000 34    .142 59    .279 84    .000
10    .000 35    .106 60    .094 85    .123
11    .000 36    .022 61    .047 86    .153
12    .054 37    .141 62    .226 87    .101
13    .196 38    .200 63    .169 88    .000
14    .092 39    .118 64    .045 89    .061
15    .147 40    .248 65    .155 90    .000
16    .220 41    .012 66    .107 91    .346
17    .034 42    .169 67    .162 92    .068
18    .060 43    .123 68    .324 93    .056
19    .101 44    .150 69    .000 94    .194
20    .043 45    .083 70    .000 95    .075
21    .000 46    .035 71    .000 96    .061
22    .149 47    .106 72    .032 97    .000
23    .228 48    .053 73    .000 98    .089
24    .318 49    .026 74    .055 99    .000
25    .132 50    .069 75    .000 100    .500

These numbers may be interpreted easily by noting that there are about 1000 documents being considered for retrieval. If each A value is multiplied by 1000, we obtain the expected position of a relevant document if ranking is optimal. For query 1, A1 = .02, suggesting that the average position of a relevant document would be at about the 20th document retrieved. A query such as 24, where A24 = .318, suggests that the average position of a relevant document would be at about the $318^{\mathit th}$ document retrieved. Clearly, most searchers who want high-recall will find an A value for this dataset over .1 to be unacceptable. The easy interpretation of measures such as A is one of the reasons for its use. If A remains constant, e.g. A=.001, one can easily see the practical impact for the searcher of retrieving these documents from a database of a thousand, a million, or hundreds of millions of documents.
next up previous
Next: Performance Superiority over a Up: Measuring Search Engine Quality Previous: System Quality and Query
Bob Losee
1999-07-29