In general, increasing the min_* features and decreasing the max_* features will regularize the model. Just sets some requirements for when to split and for leaves to be createdĭoing this is a form of regularization, that reduces overfitting of the model when running it against the test set. We improve on this by customizing some of the default hyper-parameters in scikitlearn.įirst we increase the values of min_samples_leaf and min_samples_split, which increases the value of accuracy to 95.62%. Running it initially, we find that there is a depth of 9, and an accuracy of 94.16%. If we built this with an ID3 algorithm based model, we might see different results.Meaning that these are binary trees, where each leaf only has up to two children.The model we use is a classification decision tree from scikit-learn which uses the CART algorithm. We check this in the code, finding that though there are some duplicates (where attributes are all the same), there are no instance where the attributes are the same BUT the classification is different. Quinlan, it is ideal for decision trees to be used in datasets where the same set of attributes of an object, always results in the same classification Based on a research paper, "Induction of Decision Trees" written by J.R. As you can see, malignant tumors are generally associated with larger values for the given attributesĪnother thing I checked was the "adequacy" of the data. I thought this was important because from my personal knowledge, size is a common indicator of malignant tumorsīelow is the distribution of (some!) of the features, split by the benign and malignant. The result is that malignant masses will have a value of 1, and 0 for benign massesĪ pair plot that shows the relationships between the size attributes. We also add a MinMaxScaler in the pipeline so this can be used for other models we might make as well, and to deal with the diagnosis column (which has values as 2 or 4 currently).Because this is a transformer, we don't need to add any code in the fit function The class does not need to have code that gets initialized.We create a class using BaseEstimator to include in a scikit learn pipeline that deals with certain values in the data, primarily making certain values numeric, and dealing with the nulls. The goal is to correctly classify an object based on its attributes.Each object in the dataset has a value of 2 or 4, 2 for benign and 4 for objects that are malignant.They describe the characteristics of the cell nuclei This is a dataset with 10 features, that describe features derived from images of breast masses. Meaning we can't have several examples where the same set of attributes are indicative of different classes Requires a balanced dataset without "inadequate" attributes.Slightly unstable, there might be better trees that represent the true population better.Prone to overfit, we could prune to see if that helps.White box, meaning that we can actually understand how this works.Decision trees are logarithmic in cost, meaning that this is not very intensive, especially when used on high-features datasets in comparison with other models.Here are some of the pros and cons of decision trees: Hands-On Machine Learning book for decision trees and random forests Medium article for modeling decision trees Induction of Decision Trees - Research Paperīreast Cancer dataset - UCI Machine learning database Classifying-as-Benign-or-Malignant-with-Decision-Trees
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |