Apriori algorithm – Overview

Let’s consider how we may use the support measure to reduce the number of candidate itemsets that must be explored to generate frequent itemsets (i.e. where support ≥ minsup).

Theorem 1 (Apriori principle)If an itemset is frequent, then all of its subsets must also be frequent.

Consider the itemset lattice shown in Figure 1. Suppose {c,d,e} is a frequent itemset. Clearly, any transaction that contains {c,d,e} must also contain its subsets, {c,d}, {c,e}, {d,e}, {c}, {d}, and {e}. As a result, if {c,d,e} is frequent, then all subsets of {c,d,e} – the shaded itemsets in Figure 1 – must also be frequent.

An illustration of the apriori principle

Figure 1. An illustration of the Apriori principle.

Conversely, if an itemset {a,b} is infrequent, then all of its supersets must be infrequent too. In Figure 2, the entire subgraph conctaining the supersets of {a,b} can be pruned immediately once {a,b} is found to be infrequent.

This strategy of trimming the exponential search space based on the support measure is known as support-based pruning. Such a pruning strategy is made possible by a key property of the support measure, namely, that the support for an itemset never exceeds the support for its subsets. This property is also known as the anti-monotone property of the support measure.

Definition 1. (Monotonicity property) – Let I be the set of items, and J = 2I be the power set of I. A measure f is monotone (or upward closed) if

X,Y ∈ J : (X ⊆ Y) → f(X) ≤ f(Y),

which means that if X is a subset of Y, then f(X) must not exceed f(Y).

On the other hand, f is anti-monotone (or downward closed) if

X,Y ∈ J : (X ⊆ Y) → f(Y) ≤ f(X),

which means that if X is a subset of Y, then f(Y) must not exceed f(X).

Any measure that possesses an anti-monotone property can be incorporated directly into the mining algorithm to prune the search space of candidate itemsets.

Generating FIs using the Apriori algorithm

Apriori is the first Association Rule (AR) mining algorithm to use support-based pruning to address the exponential growth of candidate itemsets.

TID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Bread, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}

Table 1. An example of market basket transactions.

Figure 2 provides a high-level illustration of Apriori’s generation of FIs for the transactions shown in Table 1.

We assume that the support threshold is 60%, which is equivalent to a minimum support count equal to 3.

Illustration of frequent itemset generation using the Apriori algorithmFigure 2. Illustration of frequent itemset generation using the Apriori algorithm

In iteration 1, every item is considered as a candidate 1-itemset. After counting their supports, the candidate itemsets {Cola} and {Eggs} are discarded because they appear in fewer than three transactions.

In iteration 2, candidate 2-itemsets are generated using only the frequent 1-itemsets, because the Apriori principle ensures that all supersets of the infrequent 1-itemsets must be infrequent. Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets generated by the algorith is

[latex]\binom{4}{2} = \frac{4!}{2!(4-2)!} = 6[/latex]

After counting their support values,two of these six candidate itemsets – {Beer, Bread} and {Beer, Milk} – turn out to be infrequent. The remaining four candidates are frequent, and are used to generate 3-itemsets in the next iteration.

In iteration 3, candidate 3-itemsets are generated using only the four frequent 2-itemsets, because the Apriori principle ensures that all supersets of the infrequent 2-itemsets must be infrequent.

Without support-based pruning, there are

[latex]\binom{6}{3} = \frac{6!}{3!(6-3)!} = 20[/latex]

candidate 3-itemsets that can be formed using the six items. With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent. The only candidate that has this property is {Bread, Diapers, Milk}.

The efficiency of the Apriori pruning strategy can be shown by counting the number of candidate itemsets generated.

A brute-force strategy of enumerating all itemsets (up to size 3) as candidates will always generate

[latex]\binom{6}{1} +\binom{6}{2} +\binom{6}{3} = \frac{6!}{1!(6-1)!} +\frac{6!}{2!(6-2)!}+\frac{6!}{3!(6-3)!} = 6+15+20 = 41[/latex]

candidate itemsets.

The Apriori pruning strategy (applied to our data) produced

[latex]\binom{6}{1} +\binom{4}{2} + 1 = \frac{6!}{1!(6-1)!} +\frac{4!}{2!(4-2)!} = 6+6+1 = 13[/latex]

candidate itemsets – a 68% reduction in our case.

Algorithm 1 presents the psuedocode for the FI generation part of the Apriori algorithm. Let Ck denote the set of candidate k-itemsets and Fk denote the set of frequent k-itemsets:

  • The algorithm initially makes a single pass over the data set to determine the support of each item. Upon completion of this step, the set of all frequent 1-itemsets – F1 – will be known (steps 1 and 2).
  • The algorithm iteratively generates new candidate k-itemsets using the frequent (k-1)-itemsets found in the previous iteration (step 5). Candidate generation is implemented using a function called apriori-gen.
  • To count the support of the candidate itemsets, the algorithm needs to make an additional pass over the data set (steps 6 – 10). The subset function is used to determine all the candidate itemsets in Ck that are contained in each transaction t. The implementation of this function is described in YYY.
  • After counting their supports, the algorithm eliminates all candidate itemsets whose support counts are less than minsup (step 12).
  • The algorithm terminates when there are no new frequent itemsets generated, i.e. Fk = Φ (step 13).

Frequent itemset generation of the Apriori algorithm

Algorithm 1. FI generation of the Apriori algorithm.

Characteristics of Apriori‘s generation of FIs

The FI generation part of the Apriori algorithm has two important characteristics:

  1. It is a level-wise algorithm – it traverses the itemset lattice one level at a time, from frequent 1-itemsets to the maximum size of FIs.
  2. It employs a generate-and-test strategy for finding FIs – at each iteration, new candidate itemsets are generated from the FIs found in the previous iteration. The support for each candidate itemset is then counted and tested against minsup.

The total number of iterations needed by the Apriori algorithm is kmax+1, where kmax is the maximum size of the FIs.

Previous: In general – Generating Frequent Itemsets (FIs) – Overview
Next: Apriori algorithm – Generating and pruning candidate FIs