logo

You are here

Warning message

Attention! This event has already passed.

Process, Data and Classifier Models for Accessible Supervised

Friday, 28 May, 2010 - 09:45
Campus: Brussels Humanities, Sciences & Engineering campus
Faculty: Science and Bio-engineering Sciences
Steven De Bruyne
phd defence

Supervised classification problems are at heart easy to understand, but
rapidly get very hard to actually solve. Many criteria can be used to determine
a classifier. The quality of the classifier can only be determined
afterwards by means of validation, in most cases using cross-validation and
10-fold cross-validation in particular. Several classification algorithms are
popular. Based on classification power some such as Support Vector Machines
and k-Nearest Neighbor can be preferred. When taking a broader
range of accessibility criteria into account, more precisely: easy of use, transparency
and power, it can be concluded that these powerful classification
algorithms do not score very high overall. This can also explain why a
classification algorithm such as C4.5 is actually more popular: it is very
accessible, even if it is less powerful in the traditional sense.

Many problems that are the source of low accessibility can be attributed
to the nature of the traditional data mining processes. These problems
can be roughly grouped in linearity shortcomings, modularity shortcomings
and inadequate user involvement shortcomings. A new process model
is introduced that solves the shortcomings of the traditional processes and
thus allows for more accessible classification problem solving. The feature
selection and construction is now done while building the classifier. This
immediately reveals the effects of these actions, making well founded decisions
possible at each stage. The process is iterative, results and insights
can now be immediately valorized during the following phases. By building
a data model at the beginning, the user no longer has to do difficult transformations.
The responsibility of being able to handle the data has been
moved to the algorithms. As the process can not continue unless the user
can decide how to continue, a tool following the new process is forced to
incorporate the idea that information must be communicated to the user
in a form the user understands. Expert knowledge from the domain expert
can then be used in two ways. First, expert knowledge can be added as
meta data when building the data model. Second, the domain expert can
directly influence the construction of the classifier, which can be based on
expert knowledge. Tools following the new process should no longer need
algorithm and parameter selections. Such a tool can display many patterns
using multiple techniques, at least one of which should be acceptable. The
optimal selections can be left to the user when selecting the path to continue.
By communicating the patterns to the domain expert, the system
does not only help the domain expert to build a classifier, but also makes
the domain expert understand the data better. By doing this in an iterative
process makes it less overwhelming and also confronts the domain expert
with more hidden patterns. The end result is that now not only a classifier
is built, but also that expert knowledge has been created. Although the
new process model may not be suitable for each classification problem, due
to maybe the scope or extreme properties of the problem, it does provide a
model for more accessible classification problem solving in other cases.

To support this classification process, the structured classification data
model is created. It defines a combination of data and a description of the
structure of the data including different kinds of meta data. The structure
enables the definition of attributes types, which are either numerical, nominal
or ordinal in nature. The most powerful addition is the possibility to
indicate whether attribute values are optional or not. It is even possible to
indicate that the existence of some attribute values is dependent on some
constraint. By allowing the domain expert to add this information, the
preconditions are set to move the responsibility of dealing with the inherent
structure of the data from the user to the classification tool. Hereby
the most difficult part of preprocessing that severely limits the accessibility
is addressed. Also some shortcomings associated with the lack of user involvement
are remedied this way. The Structured Data Meta Classification
Tree (SDMCTree) allows traditional classification algorithms to be used in
combination with the structured classification data. This way the traditional
classification process may also be followed, but the preprocessing is
still significantly reduced. Depending on the circumstances three different
algorithms to build different variations of the SDMCTree are available.

The Glass Tree classifier model complies with the aforementioned process
model and uses the structured classification data model. Contrary to
traditional classifiers, the Glass Tree comes in two forms. The Glass Tree
Creator is used to create the structure of the classifier. The Glass Tree Classifier
calibrates this structure with training data and can perform the actual
classifications. The Glass Tree finds patterns in the form of candidate splits
along attributes and candidate orientations for linear splits. This information
is then communicated to the user with the change in classification power
they bring, the information they make available, and a visualization of the
data simulating uncertainty that can be browsed through the dimensions.
This information allows the user to select the path to continue, which can be
growing of the tree with either a split on an attribute, a linear split, or a user
guided linear split. The user can also decide to backtrack previous actions
by pruning the tree. The cycle can then restart. The result is a versatile
and powerful classifier that addresses the remaining accessibility shortcomings
by extensively involving the user, and thereby meeting all requirements
of accessibility.

Attachment: 
PDF icon 201005281a.pdf