Association Analysis – Basic Concepts and Algorithms

Problem Definition

Market basket data can be represented in a binary format as shown in Table 1, where each row corresponds to a transaction and each column corresponds to an item. An item can be treated as a binary variable whose value is one if the item is present in a transaction and zero otherwise. Because the presence of an item in a transaction is often considered more important than its absence, an item is an asymmetric binary variable.

Designer      
UserID D1 D2 D3 D4
U1 1 0 0 0
U2 0 0 1 1
U3 0 0 0 0
U4 0 0 1 0
U5 0 0 0 0
U6 0 0 0 0
U7 0 0 0 0
U8 1 0 1 1

Table 1. A binary 0/1 representation of market basket data.

This representation is perhaps a very simplistic view of real market basket data because it ignores certain important aspects of the data such as the quantity of items sold or the price paid to purchase them.

[In your case, you’ll be concerned eventually with whether the visit to website resulted in a conversion, value of purchase, etc].

In General: Itemset, Transaction, and Transaction Width

Let I = {i1,i2,,id} be the set of all items in a market basket data and T = {t1,t2,…,tN} be the set of all transactions. Each transaction ti contains a subset of items chosen from I. In association analysis, a collection of zero or more items is termed an itemset. If an itemset contains k items, it is called a k-itemset. For instance, {Beer, Diapers, Milk} is an example of a 3-itemset. The null (or empty) set {∅} is an itemset that does not contain any items.

The transaction width is defined as the number of items present in a transaction. A transaction tj is said to contain an itemset X if X is a subset of tj. For example, the second transaction shown in Table 6.2 contains the itemset {Bread, Diapers} but not {Bread, Milk}.

In Your Case: Itemset, Transaction, and Transaction Width

In your case, items correspond to “Visit to Site of Designer1” – “Visit to Site of Designer2” – “Visit to Site of Designer3” – etc.

So the set of all items is I={Visit to Site of Designer1,  Visit to Site of Designer2, Visit to Site of Designer3, … etc}.

2-itemset corresponds to {Visit to Site of Designer1, Visit to a Site of Designer2} – a 3-itemset corresponds to {Visit to Site of Designer1, Visit to a Site of Designer2, Visit to Site of Designer5} etc. The null (or empty) itemset, {∅}, does not contain any Visit to the Site of any Designer.

In your case, a transaction corresponds to User1, User2, etc.

This is a potential source of confusion for you - "transaction" suggests an action (haha) or some kind of event. Eventually you may have a sufficiently large sample of data to allow you to move in this direction - then you'll be able to redefine "transaction" to mean something like "A User's Visiting the Company's website == Transaction/Session". An itemset then will then become the Sites of Designer that the User visits in the course of that Transaction/Session. A User may then contribute multiple transactions (multiple rows in the spreadsheet) within the period of observation. For now, you don't have this kind of 'transaction" - you only have one entry (one row) per User. We could maybe choose a better word than "transaction" in your case - but for the sake of aligning with the language in Chapter 6, we've kept with "transaction".

In your case, the transaction width corresponds to the Number of Sites of different Designers that a given User has visited.

Again, watch the potential confusion - a more usual understanding of transaction width would correspond to the Number of Sites of different Designers that a given User has visited within a given Session.

A transaction tj is said to contain an itemset X, if X is a subset of tj. For example, the second transaction (corresponding to User2) shown in Table 1 contains the itemset {Visit to Site of Designer3, Visit to Site of Designer4} but not {Visit to Site of Designer2, Visit to Site of Designer4}.

Support Count

An important property of an itemset is its support count, which refers to the number of transactions that contain a particular itemset.

Mathematically, the support count, σ(X), for an itemset X can be stated as follows:

Support Count Formula for Support Count - Ch 6 Equation 1

where the symbol | · | denotes the number of elements in a set.

In the data set shown in Table 1, the support count for {Visit to Site of Designer3, Visit to Site of Designer4} is equal to two because there are only two transactions that contain all two items, i.e. only two Users who had visited the sites of both Designers.

Association Rule, Support, and Confidence

An association rule is an implication expression of the form X Y , where X and Y are disjoint itemsets, i.e., X Y = ∅.

The strength of an association rule can be measured in terms of its support and confidence.

Support determines how often a rule is applicable to a given data set.

Confidence determines how frequently items in Y appear in transactions that contain X. The formal definitions of these metrics are:

Support Formula for Support - Ch 6 Equation 2
Confidence Formula for Confidence - Ch 6 Equation 3

Example

Consider the association rule {Visit to Site of Designer3, Visit to Site of Designer4} → {Visit to Site of Designer1}.

Since the support count for {Visit to Site of Designer3, Visit to Site of Designer4, Visit to Site of Designer1} is 1 and the total number of transactions (i.e. Users) is 8, support for the association rule is 1/8 = 0.125.

Confidence in the association rule is obtained by dividing the support count for {Visit to Site of Designer3, Visit to Site of Designer4, Visit to Site of Designer1} by the support count for {Visit to Site of Designer3, Visit to Site of Designer4}. There are 2 transactions that contain Visit to Site of Designer3 and Visit to Site of Designer4 – so the confidence in the association rule {Visit to Site of Designer3, Visit to Site of Designer4} → {Visit to Site of Designer1} is 1/8 divided by 2/8, or 1/2 = 0.5.

Why Use Support and Confidence?

Support is an important measure because an association rule that has very low support may occur simply by chance, i.e. randomly. A low support rule is also likely to be uninteresting from a business perspective because it may not be profitable to promote items that customers seldom buy together (with the exception discussed later). For these reasons, support is often used to eliminate uninteresting rules. As will be shown later, support also has a desirable property that can be exploited for the efficient discovery of association rules.

Confidence, on the other hand, measures the reliability of the inference made by an association rule. For a given rule X Y , the higher the confidence, the more likely it is for Y to be present in transactions that contain X. Confidence also provides an estimate of the conditional probability of Y given X.

Association analysis results should be interpreted with caution. The inference made by an association rule does not necessarily imply causality. Instead, it suggests a strong co-occurrence relationship i.e. correlation between items in the antecedent and consequent of the rule.

Next: The Problem of Mining Association Rules