Boolean Retrieval Homework

Exercise 1.2

Consider these documents:

Doc 1: breakthrough drug for schizophrenia
Doc 2: new schizophrenia drug
Doc 3: new approach for treatment of schizophrenia
Doc 4: new hopes for schizophrenia patients

a. Draw the term-document incidence matrix for this document collection.

b. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).

Answer 1.2

a. The term-document incidence matrix for this document collection is:

Term	Doc 1	Doc 2	Doc 3	Doc 4
approach	0	0	1	0
breakthrough	1	0	0	0
drug	1	1	0	0
hopes	0	0	0	1
new	0	1	1	1
patients	0	0	0	1
schizophrenia	1	1	1	1
treatment	0	0	1	0

b. The inverted index representation for this collection is:

Term	Postings
approach	Doc 3
breakthrough	Doc 1
drug	Doc 1, Doc 2
hopes	Doc 4
new	Doc 2, Doc 3, Doc 4
patients	Doc 4
schizophrenia	Doc 1, Doc 2, Doc 3, Doc 4
treatment	Doc 3

Exercise 1.3

For the document collection shown in Exercise 1.2, what are the returned results for these queries:

a. schizophrenia AND drug

b. for AND NOT(drug OR approach)

Answer 1.3

Query	Returned Documents
`schizophrenia AND drug`	Doc 1, Doc 2
`for AND NOT(drug OR approach)`	Doc 4

Exercise 1.7

Query Processing Order Recommendation:

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

Given the following postings list sizes:

Term	Postings size
eyes	213312
kaleidoscope	87009
marmalade	107913
skies	271658
tangerine	46653
trees	316812

Answer 1.7

To optimize the query processing order, we need to first calculate the OR operations with the smallest postings size.

kaleidoscope OR eyes requires traversing 87009 + 213312 = 300321 nodes.
marmalade OR skies requires traversing 107913 + 271658 = 379571 nodes.
tangerine OR trees requires traversing 46653 + 316812 = 363465 nodes.

Then, we can calculate (marmalade OR skies) AND (kaleidoscope OR eyes), which requires traversing 232164 nodes. Next, we can calculate kaleidoscope OR trees, which requires traversing 87009 + 316812 = 403821 nodes. Finally, we can calculate (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes), which requires traversing 213312 + 87009 + 232164 = 532485 nodes.

In summary, the recommended query processing order is:

kaleidoscope OR eyes and marmalade OR skies
(marmalade OR skies) AND (kaleidoscope OR eyes)
kaleidoscope OR trees
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

Exercise 1.11

How should the Boolean query x AND NOT y be handled? Why is naive evaluation of this query normally very expensive? Write out a postings merge algorithm that evaluates this query efficiently.

Answer 1.11

The Boolean query x AND NOT y means that we want all documents that contain term x but do not contain term y. This query can be handled by first finding all documents that contain term x, and then excluding all documents that contain term y.

Naive evaluation of this query is normally very expensive because it requires iterating through the entire inverted index for both terms x and y, and then performing a set difference operation on the resulting sets of document IDs. This can be very time-consuming for large indexes.

A more efficient algorithm for evaluating this query is as follows:

Retrieve the postings list for term x and term y.
Initialize two pointers, one for each postings list.
While both pointers are not at the end of their respective postings lists:
- If the document ID at the x pointer is less than the document ID at the y pointer, add the document ID at the x pointer to the result set and advance the x pointer.
- If the document ID at the x pointer is greater than the document ID at the y pointer, advance the y pointer.
- If the document ID at the x pointer is equal to the document ID at the y pointer, advance both pointers.
Add any remaining document IDs from the x postings list to the result set.

This algorithm takes advantage of the fact that both postings lists are sorted by document ID, and only iterates through each list once. Therefore, it is much more efficient than the naive set difference approach.

Here is my Python implementation of the x AND NOT y query algorithm:

def list_difference(a: list, b: list) -> list:
    if not a:
        return []
    if not b:
        return a

    results = []
    it_a, it_b = iter(a), iter(b)
    i, j = next(it_a), next(it_b)

    try:
        while True:
            if i < j:
                results.append(i)
                i = next(it_a)
            elif i > j:
                j = next(iter_b, None)
                if j is None:
                    results.append(i)
                    break
            else:
                j = next(it_b)
                i = next(it_a)
    except StopIteration:
        return results + list(it_a)

Python's implementation takes advantage of the iterator functionality, which is a Pythonic way of handling lists. Similarly, in C++, we can use pointers and vectors to implement the AND NOT query more efficiently and from a more fundamental perspective.

Here is a C++ implementation using pointers and vectors:

#include <vector>

std::vector<int> and_not_query(const std::vector<int>& x, const std::vector<int>& y) {
    std::vector<int> result;
    auto ptr_x = x.begin(), ptr_y = y.begin();

    while (ptr_x != x.end() && ptr_y != y.end()) {
        if (*ptr_x < *ptr_y) {
            result.push_back(*ptr_x);
            ++ptr_x;
        } else if (*ptr_x > *ptr_y) {
            ++ptr_y;
        } else {
            ++ptr_x;
            ++ptr_y;
        }
    }

    result.insert(result.end(), ptr_x, x.end());
    return result;
}

Proof about Time Complexity
The time complexity of the naive approach can be represented as:
$ T(n) = O(n^2) $
Where n is the length of the two lists.
In contrast, our solution has a time complexity represented as:
$ T(n) = O(n \log_2 n) $
Where n is the length of the two lists.
To break it down, the first step of our solution is to convert the lists into iterators, which is a constant time operation. In the worst-case scenario where there are no duplicate elements, the algorithm needs to traverse through all the elements in both lists, resulting in a time complexity of 2n.
The for loop in the algorithm has a time complexity represented as:
$ O(n) $
The while loop has a time complexity represented as:
$ O(\log_2 n) $
Thus, the overall time complexity of our solution is:
$ T(n) = O(n \log_2 n) $
In conclusion, our solution has a time complexity that is more efficient than the naive one.