Apriori algorithm – Generating candidate FIs

The apriori-gen function shown in Step 5 of Algorithm 1 generates candidate frequent itemsets by performing two operations:

  1. Candidate generation – This operation generates new candidate k-itemsets based on the frequent (k-1)-itemsets found in the previous iteration.
  2. Candidate pruning – This operation eliminates some of the candidate k-itemsets using the support-based pruning strategy.

To illustrate the candidate pruning operation, consider a candidate k-itemset, X = {i1, i2, i3, …, ik}.

The algorithm must determine whether all of its proper subsets, X – {ij} (j = 1, 2, 3, …, k), are frequent. If one of them is infrequent, then X is immediately pruned.

This approach can effectively reduce the number of candidate itemsets considered during support counting. The complexity of this operation is O(k) for each candidate k-itemset.

However, we do not have to examine all k subsets of a given candidate itemset. If m of the k subsets were used to generate a candidate, we only need to check the remaining k-m subsets during candidate pruning.

Requirements for a procedure to generate candidate itemsets

In principle, there are many ways to generate candidate itemsets. The following is a list of requirements for an effective candidate generate procedure:

  1. It should avoid generating too many unnecessary candidates – i.e. a candidate itemset is unnecessary if at least one of its subsets is infrequent. Such a candidate is guaranteed to be infrequent according to the anti-monotone property of support.
  2. It must ensure that the candidate set is complete – i.e. no frequent itemsets are left out by the candidate generation procedure. To ensure completeness, the set of candidate itemsets must subsume the set of all frequent itemsets, i.e. ∀k: FkCk.
  3. It should not generate the same candidate itemset more than once. E.g. the candidate itemset {a,b,c,d} can be generated many ways – by merging {a,b,c} with {d}; {b,d} with {a,c} etc. Generation of duplicate candidate leads to wasted computations and thus should be avoided for the sake of efficiency.

There are several candidate generation procedures – including the one used by the apriori-gen function.

Various procedures to generate candidate itemsets

Brute-force method

The brute-force method considers every k-itemset as a potential candidate and then applies the candidate pruning step to remove any unnecessary candidates (see Figure 1).

A brute-force method for generating candidate 3-itemsets

Figure 1. A brute-force method for generating candidate 3-itemsets.

The number of candidate itemsets generated at level k is

[latex]\binom{d}{k}[/latex]

where d is the total number of items. Although candidate generation is rather trivial, candidate pruning becomes extremely expensive because a large number of itemsets must be examined. Given that the amount of computations needed for each candidate is O(k).

Fk-1 x Fk-1 method (used by apriori-gen)

The candidate generation procedure in the apriori-gen function merges a pair of frequent (k-1)-itemsets only if their first k-2 items are identical.

Let A – {a1, a2, a3, …, ak-1} and B = {b1, b2, b3, …, bk-1} be a pair of frequent (k-1)-itemsets. A and B are merged if they satisfy the following conditions:

ai = bi (for i = 1,2,3, …, k-2) and ak-1 ≠ bk-1

In Figure 2, the frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a candidate 3-itemset {Bread, Diapers, Milk}. The algorithm does not have to merge {Beer, Diapers} with {Diapers, Milk} because the first item in both itemsets is different. Indeed, if {Beer, Diapers, Milk} is a viable candidate, it would have been obtained by merging {Beer, Diapers} with {Beer, Milk} instead. This example illustrate both the completeness of the candidate generation procedure and the advantages of using lexicographic ordering to prevent duplicate candidates. However, because each candidate is obtained by merging a pair of frequent (k-1)-itemsets, an additional candidate pruning step is needed to ensure that the remaining k-2 subsets of the candidate are frequent.

Generating and pruning candidate k-itemsets by merging frequent pairs of frequent k-1-itemsets

Figure 2. Generating and pruning candidate k-itemsets by merging frequent pairs of frequent (k-1)-itemsets.

Fk-1 x F1 method – not elaborated here

Support counting

Support counting is the process of determining the frequency of occurrence of every candidate itemset that survives the candidate pruning step of the apriori-gen function. Support counting is implemented in Steps 6 through 11 of Algorithm 1.

One approach is to compare each transaction against every candidate itemset and to update the support counts of candidates contained in the transaction (Figure 3). This approach is computationally expense, especially when the numbers of transactions and candidate itemsets are large.

Counting the support of candidate itemsets

Figure 4. Counting the support of candidate itemsets.

An alternative is to enumerate the itemsets contained in each transaction and use them to update the support counts of their respective candidate itemsets.

To illustrate, consider a transaction t that contains five items, {1,2,3,4,5}. There are

[latex]\binom{5}{3} = \frac{5!}{3!(5-3)!} = 10[/latex]

itemsets of size 3 contained in this transaction. Some of the itemsets may correspond to the candidate 3-itemsets under investigation, in which case their support counts are incremented. Other subsets of t that do not correspond to any candidates can be ignored.

Figure 4 shows a systematic way for enumerating the 3-itemsets contained in t. Assuming that each itemset keeps its items in increasing lexicographic order, an itemset can be enumerated by specifying the smallest item first, followed by the larger items. For instance, given t = {1,2,3,4,5,6}, all the 3-itemsets contained in t must begin with item 1, 2, or 3.

Enumerating subsets of three items from a transaction t

Figure 4. Enumerating subsets of three items from a transaction t.

The number of ways to specify the first item of a 3-itemset contained in t is illustrated by the Level 1 prefix structures depicted in Figure 4. After fixing the first item, the prefix structure at Level 2 represent the number of ways to select the second item. Finally, the prefix structures at Level 3 represent the complete set of 3-itemsets contained in t.

The prefix structures shown in Figure 4 demonstrate how itemsets contained in a transaction can be systematically enumerated – i.e. by specifying their items one by one, from the leftmost item to the rightmost item.

We still have to determine whether each enumerated 3-itemset corresponds to an existing candidate itemset.

Support counting using a Hash Tree

In the Apriori algorithm, candidate itemsets are partitioned into different buckets and stored in a hash tree. During support counting, itemsets contained in each transaction are also hashed into their appropriate buckets. That way, instead of comparing each itemset in the transaction with every candidate itemset, it is matched only against candidate itemsets that belong to the same bucket (Figure 5).

Counting the support of itemsets using hash structure

Figure 5. Counting the support of itemsets using hash structure.

We don’t elaborate on procedure

Also, we don’t elaborate on computational complexity of Apriori algorithm

Previous: Apriori algorithm – Overview
Next: Apriori Algorithm – Generating Association Rules (ARs)