What's New in CART 5.0?

10/03/2004

CART 5.0 - Offers the most intuitive decision-tree interface with an underlying algorithm that is more versatile and powerful than ever.

CART 5.0, incorporates many new user-requested enhancements and features. The most intuitive decision-tree interface is now even easier to use and the underlying algorithm is more versatile and powerful than ever.

Discrete Variables

Discrete variables (a.k.a. categorical variables) are those that take on a finite set of distinct values. Predictor variables can be discrete, as can the target variable (in which case the model is a classification tree).

In CART 4, discrete variables were called "categorical." They were required to be numeric and had to take on a contiguous series of whole-number values, e.g., 1, 2, 3, 4,.… If the data were naturally in text form (e.g., "Republican," "Democrat," "Libertarian," "Independent") or took on distinct but fractional numeric values (e.g., 1.1, 1.2, 1.5, 2.0), the user was required to recode the variable into a contiguous, whole-number form. In addition, if the target variable was categorical the user was required to specify in advance - prior to building the model - the number of target classes it took on. This placed a burden on the analyst to become familiar enough with a classification target to know how many distinct levels it took on in the data sample being analysed. Changing the sample, perhaps by including additional data or excluding certain records, or simply by changing the partitioning of learn and test samples, could change the number of learn sample target classes - requiring the user to "recount" the target classes before proceeding with a new model.

CART 5 handles discrete variables in the following more flexible and easier-to-use ways:

  • Ability to Automatically Detect Distinct Classes

    It is no longer necessary to identify how many distinct classes a discrete variable has, even if the variable is the target. Thus, the user has only to identify which variables are to be treated as discrete; CART will figure out the rest.

  • Fractional Values Can Be Specified as Categorical

    Numeric discrete variables no longer need to take on whole-number values, nor do the values need to be contiguous. For instance, the following series of distinct values is supported in CART 5: 0.01, 0.1, 1.0, 1.001, 200, -500.

  • Character Data

    Character variables are now fully supported, as predictors, as the target or as auxiliary variables (described below). This is an important new feature because modern data, especially those arising from web logs and Internet transactions, are often character in nature.

Native Support for Text Data

CART 5 includes native support for text datasets, the most flexible and natural format for many users to maintain data. A single delimiter is used throughout the dataset, usually a comma, but semicolon, space, and tab are also supported as delimiters. ( See Chapter 3: Reading Data; Reading ASCII files.)

Data Information

The Data Info window is a new display in CART 5 that offers summary information about variables in your dataset. Included are continuous statistics (N, mean, sum, min, max, variance, standard deviation, skewness and kurtosis, conditional mean, N equal and unequal to 0.0 that may be weighted by a case weight variable. Also available is a fully-weighted tabulation of distinct values, along with quantiles, quartiles and interquartile range, and N-most and -least frequent values.

Auxiliary Variables

CART 5 introduces the "auxiliary" variable. Any variable (discrete/continuous, character/numeric) can be summarised with descriptive statistics or a frequency distribution at any node level. Such variables are termed "auxiliary" variables. It is not necessary for auxiliary variables to be predictors in the model, although they can be. For example, profit and revenue measures in the dataset can be summarised for each node without affecting the growth of the tree (i.e., they are not predictors in the model), allowing the most profitable partitions to be easily identified.

Groves

CART 5 introduces "grove" files, which replace the pre-CART 5 .TR1 file. A grove file is a binary file that stores all the information about the tree sequence needed to apply any tree from the sequence to new data, or to translate (export) the tree into a different presentation language. Grove files contain a variety of information, including node information, the optimal tree indicator, and predicted probabilities. Grove files are not limited to storing only one tree sequence, but may contain entire collections of trees obtained as a result of bagging, arcing, or cross validation. The file format is flexible enough to easily accommodate further extensions and exotic tree-related objects created in other Salford Systems' applications.

Once a grove file is created, it can be translated into SAS®compatible, C, and PMML languages. Current and future API and OEM development efforts will focus on grove files and functions that query and manipulate models in grove files.

Exporting CART Model Information

CART 5 includes the ability to export the model information contained within the binary grove file, including primary and surrogate splitting rules for various programming language codes. The files containing the exported code can be used outside CART for scoring data. Export language formats currently supported are SAS® compatible, C, and PMML.

Missing Value Summary Report

This report identifies the proportion of records missing for the target and each predictor variable, and for each sample (learn/test), sorted from most- to least-missing.

Entropy Splitting Rule

This well-known splitting rule is related to the likelihood function. With multilevel targets it tends to look for splits where some or as many levels as possible are divided perfectly or near perfectly. As a result Entropy puts more emphasis on getting rare levels right relative to common levels than either Gini or Twoing. In different circumstances, its properties may be similar to Gini or Twoing or somewhere between them.

32-Character Variable Names

CART 5 supports variable names up to 32 characters.

Path Length Extended to Windows Maximum

CART 5 supports a Windows maximum path length of 256 characters (including the file name).

Improved Navigator Window

  • The tree topology navigator now allows you to display either the learn sample or the test sample.

  • Toggle the secondary navigator window panel to display terminal node counts or the relative cost curve with an emphasis on all trees within one standard deviation.

  • The new action button in the navigator allows the user to easily save the navigator and the grove file, score new data, or translate the tree into one of the available languages.

  • Easily compare learn and test samples at any level of the tree.

  • View the tree topology display with the focus on any specified auxiliary variable.

  • View auxiliary variable descriptive statistics or frequency tables at any level of the tree.

New and Improved Summary Reports

  • An improved terminal node report now enables you to evaluate the purity or homogeneity of the terminal nodes, an indication of how well CART has partitioned the classes.

  • A prediction success report allows you to specify a focus target class, enabling quicker analysis of the most important class.

  • A learn/test sample breakdown is available in a majority of the post-processing result windows and dialogs.

  • Result windows allow the user to toggle displays between the percent of data or the number of cases.

  • Result windows allow the user to choose among various graph forms.

Additional Tree Details for Viewing and Printing

An increased level of tree detail provides more information, allowing greater control when displaying and printing your trees.

  • Display weighted and/or unweighted case counts.

  • Specify level of decimal place precision.

  • Target class breakdown in a color-coded histogram.

  • Display a node-splitting variable name inside and/or outside the node display.

  • Specify and save the default level of detail for subsequent displays.

  • Specify detail for internal and terminal nodes separately.

  • Quickly change or set the level of magnification for clear and easy on-screen viewing.

  • Toggle the tree displays for "compact" or "expanded" views.

  • Display a class table with or without color coding.

Improved Model Setup

Quickly and easily specify the target, predictor, categorical, weighting, and auxiliary variables in a single setup tab.

Easy Data Access

We have continued a direct link to DBMS/Copy™, with easy access to over 90 different file formats, including more than ten new formats,. For example, you can import and export statistical analysis packages (e.g., SAS®, SPSS), and spreadsheets (e.g., Excel, Lotus). We have also added native support for text datasets.

Printing

Automatic page fitting allows the user to easily print trees on two pages when possible. Upgraded support for large format printing and plot printing. Allows the user to easily produce presentation-quality printing of large trees on a single piece of paper. 

Salford Systems Wins the KDDCup 2000 Data Mining/Web Mining Competition

30/03/2002

Salford Systems Wins the KDDCup 2000 Data Mining/Web Mining Competition

Using the latest versions of CART and MARS, Salford Systems was a winner in this year's international data mining competition organized by the ACM (Association for Computing Machinery), the School of Electrical and Computer Engineering, Purdue University, and Blue Martini Software. Competing against a world class field which included most of the best known firms in data mining they were able to come in first by combining a decade of data mining experience with CART and MARS.

The two-month long competition involved intensive analysis of the clickstream and customer data of an e-commerce retail web site. Contestants were provided data from February and March of this year and were asked to apply their models against the April clickstream. Models were designed to support web-site personalization and to improve the profitability of the site by increasing customer response. Salford Systems won in the category of identifying and characterizing the most valuable customers using Salford Systems CART, MARS, HotSpotDetector and TreeNet software.

For more information about KDDCup 2000 see http://www.ecn.purdue.edu/KDDCUP