Exercise 1.2
Consider these documents:
- Doc 1: breakthrough drug for schizophrenia
- Doc 2: new schizophrenia drug
- Doc 3: new approach for treatment of schizophrenia
- Doc 4: new hopes for schizophrenia patients
a. Draw the term-document incidence matrix for this document collection.
b. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).
Answer 1.2
a. The term-document incidence matrix for this document collection is:
Term | Doc 1 | Doc 2 | Doc 3 | Doc 4 |
---|---|---|---|---|
approach | 0 | 0 | 1 | 0 |
breakthrough | 1 | 0 | 0 | 0 |
drug | 1 | 1 | 0 | 0 |
hopes | 0 | 0 | 0 | 1 |
new | 0 | 1 | 1 | 1 |
patients | 0 | 0 | 0 | 1 |
schizophrenia | 1 | 1 | 1 | 1 |
treatment | 0 | 0 | 1 | 0 |
b. The inverted index representation for this collection is:
Term | Postings |
---|---|
approach | Doc 3 |
breakthrough | Doc 1 |
drug | Doc 1, Doc 2 |
hopes | Doc 4 |
new | Doc 2, Doc 3, Doc 4 |
patients | Doc 4 |
schizophrenia | Doc 1, Doc 2, Doc 3, Doc 4 |
treatment | Doc 3 |
Exercise 1.3
For the document collection shown in Exercise 1.2, what are the returned results for these queries:
a. schizophrenia AND drug
b. for AND NOT(drug OR approach)
Answer 1.3
Query | Returned Documents |
---|---|
schizophrenia AND drug | Doc 1, Doc 2 |
for AND NOT(drug OR approach) | Doc 4 |
Exercise 1.7
Query Processing Order Recommendation:
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
Given the following postings list sizes:
Term | Postings size |
---|---|
eyes | 213312 |
kaleidoscope | 87009 |
marmalade | 107913 |
skies | 271658 |
tangerine | 46653 |
trees | 316812 |
Answer 1.7
To optimize the query processing order, we need to first calculate the OR operations with the smallest postings size.
kaleidoscope OR eyes
requires traversing 87009 + 213312 = 300321 nodes.marmalade OR skies
requires traversing 107913 + 271658 = 379571 nodes.tangerine OR trees
requires traversing 46653 + 316812 = 363465 nodes.
Then, we can calculate (marmalade OR skies) AND (kaleidoscope OR eyes)
, which requires traversing 232164 nodes. Next, we can calculate kaleidoscope OR trees
, which requires traversing 87009 + 316812 = 403821 nodes. Finally, we can calculate (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
, which requires traversing 213312 + 87009 + 232164 = 532485 nodes.
In summary, the recommended query processing order is:
kaleidoscope OR eyes
andmarmalade OR skies
(marmalade OR skies) AND (kaleidoscope OR eyes)
kaleidoscope OR trees
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
Exercise 1.11
How should the Boolean query x AND NOT y
be handled? Why is naive evaluation of this query normally very expensive? Write out a postings merge algorithm that evaluates this query efficiently.
Answer 1.11
The Boolean query x AND NOT y
means that we want all documents that contain term x
but do not contain term y
. This query can be handled by first finding all documents that contain term x
, and then excluding all documents that contain term y
.
Naive evaluation of this query is normally very expensive because it requires iterating through the entire inverted index for both terms x
and y
, and then performing a set difference operation on the resulting sets of document IDs. This can be very time-consuming for large indexes.
A more efficient algorithm for evaluating this query is as follows:
- Retrieve the postings list for term
x
and termy
. - Initialize two pointers, one for each postings list.
- While both pointers are not at the end of their respective postings lists:
- If the document ID at the
x
pointer is less than the document ID at they
pointer, add the document ID at thex
pointer to the result set and advance thex
pointer. - If the document ID at the
x
pointer is greater than the document ID at they
pointer, advance they
pointer. - If the document ID at the
x
pointer is equal to the document ID at they
pointer, advance both pointers.
- If the document ID at the
- Add any remaining document IDs from the
x
postings list to the result set.
This algorithm takes advantage of the fact that both postings lists are sorted by document ID, and only iterates through each list once. Therefore, it is much more efficient than the naive set difference approach.
Here is my Python implementation of the x AND NOT y
query algorithm:
def list_difference(a: list, b: list) -> list:
if not a:
return []
if not b:
return a
results = []
it_a, it_b = iter(a), iter(b)
i, j = next(it_a), next(it_b)
try:
while True:
if i < j:
results.append(i)
i = next(it_a)
elif i > j:
j = next(iter_b, None)
if j is None:
results.append(i)
break
else:
j = next(it_b)
i = next(it_a)
except StopIteration:
return results + list(it_a)
Python's implementation takes advantage of the iterator functionality, which is a Pythonic way of handling lists. Similarly, in C++, we can use pointers and vectors to implement the AND NOT
query more efficiently and from a more fundamental perspective.
Here is a C++ implementation using pointers and vectors:
#include <vector>
std::vector<int> and_not_query(const std::vector<int>& x, const std::vector<int>& y) {
std::vector<int> result;
auto ptr_x = x.begin(), ptr_y = y.begin();
while (ptr_x != x.end() && ptr_y != y.end()) {
if (*ptr_x < *ptr_y) {
result.push_back(*ptr_x);
++ptr_x;
} else if (*ptr_x > *ptr_y) {
++ptr_y;
} else {
++ptr_x;
++ptr_y;
}
}
result.insert(result.end(), ptr_x, x.end());
return result;
}
Proof about Time Complexity
The time complexity of the naive approach can be represented as:
$ T(n) = O(n^2) $
Where n is the length of the two lists.
In contrast, our solution has a time complexity represented as:
$ T(n) = O(n \log_2 n) $
Where n is the length of the two lists.
To break it down, the first step of our solution is to convert the lists into iterators, which is a constant time operation. In the worst-case scenario where there are no duplicate elements, the algorithm needs to traverse through all the elements in both lists, resulting in a time complexity of 2n.
The for loop in the algorithm has a time complexity represented as:
$ O(n) $
The while loop has a time complexity represented as:
$ O(\log_2 n) $
Thus, the overall time complexity of our solution is:
$ T(n) = O(n \log_2 n) $
In conclusion, our solution has a time complexity that is more efficient than the naive one.