Oracle Data Mining Concepts 10g Release 1 (10.1) Part Number B1069801 


View PDF 
This chapter describes the predictive models, that is, the supervised learning functions. These functions predict a target value. The Oracle Data Mining Java interface supports the following predictive functions and associated algorithms:
Function  Algorithm 

Classification (Section 3.1) 
Naive Bayes (Section 3.1.3) 
Adaptive Bayes Network (Section 3.1.4)  
Support Vector Machine (Section 3.1.6)  
Regression (Section 3.2) 
Support Vector Machine (Section 3.2.1) 
Attribute Importance (Section 3.3) 
Minimal Descriptor Length (Section 3.3.1) 
This chapter also describes ODM Model Seeker (Section 3.4), which builds several Naive Bayes and Adaptive Bayes Network models and selects the best one.
In a classification problem, you typically have historical data (labeled examples) and unlabeled examples. Each labeled example consists of multiple predictor attributes and one target attribute (dependent variable). The value of the target attribute is a class label. The unlabeled examples consist of the predictor attributes only. The goal of classification is to construct a model using the historical data that accurately predicts the label (class) of the unlabeled examples.
A classification task begins with build data (also know as training data) for which the target values (or class assignments) are known. Different classification algorithms use different techniques for finding relations between the predictor attributes' values and the target attribute's values in the build data. These relations are summarized in a model, which can then be applied to new cases with unknown target values to predict target values. A classification model can also be used on build data with known target values, to compare the predictions to the known answers; such data is also known as test data or evaluation data. This technique is called testing a model, which measures the model's predictive accuracy. The application of a classification model to new data is called applying the model, and the data is called apply data or scoring data. Applying data is often called scoring the data.
Classification is used in customer segmentation, business modeling, credit analysis, and many other applications. For example, a credit card company may wish to predict which customers will default on their payments. Each customer corresponds to a case; data for each case might consist of a number of attributes that describe the customer's spending habits, income, demographic attributes, etc. These are the predictor attributes. The target attribute indicates whether or not the customer has defaulted; that is, there are two possible classes, corresponding to having defaulted or not. The build data is used to build a model that you then use to predict, for new cases, whether these new customers are likely to default.
In a classification problem, it may be important to specify the costs involved in making an incorrect decision. Doing so can be useful when the costs of different misclassifications vary significantly.
For example, suppose the problem is to predict whether a user will respond to a promotional mailing. The target has two categories: YES (the customer responds) and NO (the customer does not respond). Suppose a positive response to the promotion generates $500 and that it costs $5 to do the mailing. If the model predicts YES and the actual value is YES, the cost of misclassification is $0. If the model predicts YES and the actual value is NO, the cost of misclassification is $5. If the model predicts NO and the actual value is YES, the cost of misclassification is $500. If the model predicts NO and the actual value is NO, the cost is $0.
The row indexes of a cost matrix correspond to actual values; the column indexes correspond to predicted values. For any pair of actual/predicted indexes, the value indicates the cost of misclassification.
Classification algorithms apply the cost matrix to the predicted probabilities during scoring to estimate the least expensive prediction. If a cost matrix is specified for apply, the output of the scoring run is prediction and cost., rather than predication and probability
In building a classification model, priors can be useful when the training data does not accurately reflect the real underlying population. A priors vector is used to inform the model of the true underlying distribution of target classes in the underlying population. The model build adjusts its predicted probabilities for Adaptive Bayes Network and Naive Bayes or relative Complexity factor for Support Vector Machine.
The Naive Bayes algorithm (NB)N can be used for both binary and multiclass classification problems to answer questions such as "Which customers will switch to a competitor? Which transaction patterns suggest fraud? Which prospects will respond to an advertising campaign?" For example, suppose a bank wants to promote its mortgage offering to its current customers and that, to reduce promotion costs, it wants to target the most likely prospects. The bank has historical data for its customers, including income, number of household members, moneymarket holdings, and information on whether a customer has recently obtained a mortgage through the bank. Using NB, the bank can predict how likely a customer is to respond positively to a mortgage offering. With this information, the bank can reduce its promotion costs by restricting the promotion to the most likely candidates.
NB affords fast model building and scoring for relatively low volumes of data.
NB makes predictions using Bayes' Theorem, which derives the probability of a prediction from the underlying evidence.Bayes' Theorem states:
P(A  B) = (P(B  A) P(A))/P(B)
That is, the probability of event A occurring given that event B has occurred is equal to the probability of event B occurring given that event A has occurred, multiplied by the probability of event A occurring and divided by the probability of event B occurring.
NB assumes that each attribute is conditionally independent of the others: given a particular value of the target, the distribution of each predictor is independent of the other predictors.
In practice, this assumption, even when violated, does not degrade the model's predictive accuracy significantly, and makes the difference between a fast, computationally feasible algorithm and an intractable one.
Naive Bayes lets you, using crossvalidation, test model accuracy on the same data that was used to build the model, rather than building the model on one portion of the data and testing it on a different portion. Not having to hold aside a portion of the data for testing is especially useful if the amount of build data is relatively small.
"Leaveoneout crossvalidation" is a special case of crossvalidation in which one record is left out of the build data when building a model. The number of models built equals the number of records (omitting a different build record for each model), which makes this procedure computationally expensive. With Naive Bayes models, however, the approach can be modified such that all build records are used for building a single model. Then, the model is repeatedly modified to quickly remove the effects of one build record, incrementally "unbuilding" the model for that record, as though that record had been omitted when building the model in the first place. The accuracy of the prediction for each build record can then be assessed against the model that would have been built from all the build records except that one, without having had to actually build a separate model for each build record.
To use Naive Bayes crossvalidation, the user executes a CrossValidate task object, specifying that a Naive Bayes model is to be tested. The execution of the crossvalidate task creates a ClassificationTestResult associated with classification test metrics.
See Table 31, below, for a comparison of the main characteristics of ABN and NB.
Adaptive Bayes Network (ABN) is an Oracle proprietary algorithm that provides a fast, scalable, nonparametric means of extracting predictive information from data with respect to a target attribute. (Nonparametric statistical techniques avoid assuming that the population is characterized by a family of simple distributional models, such as standard linear regression, where different members of the family are differentiated by a small set of parameters.)
ABN, in single feature build mode, can describe the model in the form of humanunderstandable rules. The rules produced by ABN are one of its main advantages over Naive Bayes. The business user, marketing professional, or business analyst can understand the basis of the model's predictions and can therefore be comfortable acting on them and explaining them to others.
In addition to explanatory rules, ABN provides performance and scalability, which are derived via a collection of user parameters controlling the tradeoff of accuracy and build time.
ABN predicts binary as well as multiclass targets. Binary targets are those that take on only two values, for example, buy and not buy. Multiclass targets have more than two values, for example, products purchased (product A or product B or product C). Multiclass target values are not assumed to exist in an ordered relation to each other, for example, hair brush is not assumed to be greater or less than comb.
ABN can use costs and priors for both building and scoring (see Section 3.1.1 and Section 3.1.2).
An ABN model is an (adaptive conditional independence model that uses the minimum description length principle to construct and prune an array of conditionally independent Network Features. Each Network Feature consists of one or more Conditional Probability Expressions. The collection of Network Features forms a product model that provides estimates of the target class probabilities. There can be one or more Network Features. The number and depth of the Network Features in the model determine the model mode. There are three model modes for ABN:
Users can select the ABN model type. Rules are available only for Single Feature Build; see Section 3.1.4.2 for information about rules.
Each Network Feature consists of one or more attributes included in a Conditional Probability Expression. An array of single attribute Network Features is an MDLpruned Naive Bayes model. A single multiattribute Network Feature model is equivalent to a simplified C4.5 decision tree; such a model is simplified in the sense that numerical attributes are binned and treated as categorical. Furthermore, a single predictor is used to split all nodes at a given tree depth. The splits are kway, where k is the number of unique (binned) values of the splitting predictor. Finally, a collection of multiattribute Network Features forms a product model (boosted mode). All three types provide estimates of the target class probabilities.
Rules can be extracted from the Adaptive Bayes Network Model as Compound Predicates. Rules form a humaninterpretable depiction of the model and include statistics indicating the number of the relevant training data instances in support of the rule. A record apply instance specifies a pathway in a network feature taking the form of a compound predicate.
For example, suppose the feature consists of two training attributes: Age {2040, 4060, 6080} and Income {<=50K, >50K}. A record instance consisting of a person age 25 and income $42K is expressed as
IF AGE IN (2040) and INCOME IN (<=50K)
Suppose that the associated target (for example, response to a promotion) probabilities are {0.8 (no), 0.2 (yes)}. Then we have a detailed rule of the form
IF AGE IN (2040) and INCOME IN (<=50K) => Prob = {0.8, 0.2}
In addition to the probability distribution, there are the associated training data counts, e.g. {400, 100}. Suppose there is a cost matrix specifying that it is 6 times more costly to incorrectly predict a no than to incorrectly predict a yes. Then the cost of predicting yes for this instance is 0.8 * 1 = 0.8 (because the model is wrong in this prediction 80% of the time) and the cost of predicting no is 0.2 * 6 = 1.2. Thus, the minimum cost (best) prediction is yes.
Without the cost matrix and the decision is reversed. Implicitly, all errors are equal and we have: 0.8 * 1 = 0.8 for yes and 0.2 * 1 = 0.2 for no.
The order of the predicates in the generated rules implies relative importance.
When you apply an ABN model for which rules were generated, with a single feature, you get the same result that you would get if you wrote an external program that applied the rules.
To control the execution time of a build, ABN provides the following usersettable parameters:
Note that only single feature model results in rules.
When you specify MaxBuildTime for a boosted mode ABN model, the ABN build terminates in one of the following states:
The algorithm outputs its current model state and statistics that provide an estimate of how long it would take for the model to build (and prune) a feature.
Table 31 compares the main characteristics of Adaptive Bayes Network and Naive Bayes.
Support Vector Machine (SVM) is a classification and regression prediction tool that uses machine learning theory to maximize predictive accuracy while automatically avoiding overfit to the data.
Neural networks (NN) and radial basis functions (RBFs), both popular data mining techniques, can be viewed as a special case of SVMs.
SVMs perform well with realworld applications such as classifying text, recognizing handwritten characters, classifying images, as well as bioinformatics and biosequence analysis. Their introduction in the early 1990s led to an explosion of applications and deepening theoretical analysis that established SVM along with neural networks as one of the standard tools for machine learning and data mining.
There is no upper limit on the number of attributes and target cardinality for SVMs.
The SVM kernel functions supported at this release are linear and Gaussian.
You can influence both the Support Vector Machine (SVM) model quality (accuracy) and performance (build time) through two basic mechanisms: data preparation and model settings. Significant performance degradation can be caused by a poor choice of settings or inappropriate data preparation. Poor settings choices can also lead to inaccurate models. ODM has built in mechanisms that attempt to choose appropriate settings for the problem at hand by default. ODM provides data normalization and explosion of categorical fields for data preparation. You may need to override the defaults or do your own data preparation for some domains. For detailed information about data preparation for SVM models, see the Oracle Data Mining Application Developer's Guide.
SVM uses zscore or minmax normalization. The transformed data for each attribute has a mean of 0 and a standard deviation of 1; values can extend beyond the range 1 to +1, and there is no special treatment for sparse data.
Regression creates predictive models. The difference between regression and classification is that regression deals with numerical/continuous target attributes, whereas classification deals with discrete/categorical target attributes. In other words, if the target attribute contains continuous (floatingpoint) values, a regression technique is required. If the target attribute contains categorical (string or discrete integer) values, a classification technique is called for.
The most common form of regression is linear regression, in which a line that best fits the data is calculated, that is, the line that minimizes the average distance of all the points from the line.
This line becomes a predictive model when the value of the dependent variable is not known; its value is predicted by the point on the line that corresponds to the values of the independent variables for that record.
Support Vector Machine (SVM) builds both classification and regression models. SVM is described in the section on classification; see Section 3.1.6, "Support Vector Machine".
Attribute Importance (AI) provides an automated solution for improving the speed and possibly the accuracy of classification models built on data tables with a large number of attributes.
Attribute Importance ranks the predictive attributes by eliminating redundant, irrelevant, or uninformative attributes and identifying those predictor attributes that may have the most influence in making predictions. ODM examines data and constructs classification models that can be used to make predictions about subsequent data. The time required to build these models increases with the number of predictors. Attribute Importance helps a user identify a proper subset of these attributes that are most relevant to predicting the target. Model building can proceed using the selected attributes (predictor attributes) only.
Using fewer attributes decreases model building time, although sometimes at a cost in predictive accuracy. Using too many attributes (especially those that are "noise") can affect the model and degrade its performance and accuracy. By extracting as much information as possible from a given data table using the smallest number of attributes, a user can save significant computing time and often build better models.
Attribute Importance permits the user to specify a number or percentage of attributes to use; alternatively the user can specify a cutoff point. After an Attribute Importance model is built, the user can select the subset of attributes based on the ranking or the predictive value.
Attribute Importance can be applied to data tables with a very large set of attributes. However, the DBA may have to tune the database in various ways to ensure that Attribute Importance build executes efficiently. For example, it is important to ensure that there is adequate swap space and table space.
Minimum Description Length (MDL) is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of it.
With MDL, the model selection problem is treated as a communication problem. There is a sender, a receiver, and data to be transmitted. For classification models, the data to be transmitted is a model and the sequence of target class values in the training data. Typically, each model under consideration comes from a list of potential candidate models. The sizes of the lists vary. The models compute probabilities for the target class of each training data set row. The model's predicted probability is used to encode the training data set target values. From Shannon's noiseless coding theorem, it is known that the most compact encoding uses a number of bits equal to log
_{2}(p
_{i})
, where p
_{i} is probability of the target value in the i
th training data set row for all rows in the data set.
The sender and receiver agree on lists of potential candidate models for each model under consideration. A model in a list is represented by its position in the list (for example, model #26). The position is expressed as a binary number. Thus, for a list of length m, the number of bits in the binary number is log
_{2}(m)
. The sender and receiver each get a copy of the list. The sender transmits the model. Once the model is known, the sender know how to encode the targets and the receiver knows how to decode the targets. The sender then transmits the encoded targets. The total description length of a model is then:
log2(m)  Sum(log2(pi))
Note that the better the model is at estimating the target probabilities, the shorter the description length. However, the improvement in target probability estimation can come at the expense of a longer list of potential candidates. Thus the description length is a penalized likelihood of the model.
The attribute importance problem can be put in the MDL framework, by considering each attribute as a simple predictive model of the target class. Each value of the predictor (indexed by i) has associated with it a set of n
_{i} training examples and a probability distribution, p
_{ij} for the m target class values (indexed by j). From n
_{i} training examples there are at most n
_{i} p
_{ij}'s distinguishable in the data. From combinatorics, it is known that the number of distributions of m distinct objects in a total of n objects is nm choose m
. Hence the size of the list for a predictor is
Sumi log2((ni  m choose m))
The total description length for a predictor is then:
Sumi (log2(ni  m choose m))  Sumij(log2(pij))
The predictor rank is the position in the list of associated description lengths, smallest first.
ODM Model Seeker allows a user to build multiple ABN and NB models and select the "best" model based on a predetermined criterion.
Note: Model Seeker is deprecated in Oracle Data Mining 10g Release 1 (10.1) . It will not be available in subsequent releases of ODM; the functionlaity will be supported in other ways. 
With Model Seeker, the user can compactly specify parameters for an execution that will asynchronously build and test multiple classification models. These models may be built using different algorithms. For each algorithm, Model Seeker systematically varies the algorithm parameters to create a series of parameter values that are used to create corresponding models. Model Seeker then evaluates these models and selects a "best" model using a predetermined criterion.
These features were developed to support applications that want to create a model that can be directly used in a production application environment with a minimum of user interaction. This assumes that the application developer has the expertise, or advice from someone with the expertise, to set the various algorithm parameters in a manner that is suitable to the particular business problem and data sources on which the application is to operate.
A secondary purpose is to present to the user or application a summary of information about each model built and tested by a Model Seeker execution so that the user or application can independently find the parameters that correspond to an alternative "best" model using a different criterion. This feature is meant to provide support for an application for use by an analyst who will experiment with alternative parameter settings to discover the appropriate settings to use in a production environment.
Model Seeker's criterion for the "best" model is the one with the largest value for a calculated weighted relative accuracy. The weight used in this calculation is the relative importance of the positive target category to the other categories treated as a single negative category. If the weight is set to 1.0, the positive target category relative accuracy has the same weight as the relative accuracy of all the other categories combined.
The following formula is used to calculate the figure of merit (FOM) for the "best" model, where FOM is the weighted sum of the positive target relative accuracy and the total negative relative accuracy:
FOM = W * (number of correct positives) + (number of correct negatives)
(W + 1) * (number of actual positives) (W + 1) * (number of actual negatives)
where W is the userspecified weight, a value that must be > 0.0. A recommended way for a user to choose a value for the weight is as follows. First, estimate the cost to the user of predictions that are all correct except for a random fixed percentage, say 5%, of positive predictions being false positives. Second, estimate the cost to the user of predictions that are all correct except for the same random fixed percentage, 5%, of negative predictions being false negatives. Use a value for weight equal to the ratio of the second estimate to the first estimate. A weight of 1.0 means that a given percentage of false negatives has the same cost as a given percentage of false positives.