Difference between revisions of "Random Forest in Data Analytics"

From MIT Technology Roadmapping
Jump to navigation Jump to search
Line 93: Line 93:
[[File:RF-comp-small.png]]
[[File:RF-comp-small.png]]


Zillow-like major company tend to have very high R2, which heavily utilize data analytics in their price prediction model. To have great accuracy, it would include as many variables information as possible, as many observations as possible. This will help to "train" their model to be very smart. Meanwhile, they have scarified simplicity with the many variables, and large number of models built for machine learning purpose. Whereas some local "mom and pop store" would utilize very simple models such as liner regression or CART, of which the accuracy of the model is sacrificed. Right now our hypothetical company is trying to utilize random forest to deliver good accuracy while still maintain certain level of simplicity  
Zillow-like major company tend to have very high R2, which heavily utilize data analytics in their price prediction model. To have great accuracy, it would include as many variables information as possible, as many observations as possible. This will help to "train" their model to be very smart. Meanwhile, they have sacrificed simplicity with the many variables, and large number of models built for machine learning purpose. Whereas some local "mom and pop stores" would utilize very simple models such as liner regression or CART, of which the accuracy of the model is sacrificed. Right now our hypothetical company is trying to utilize random forest to deliver good accuracy while still maintain certain level of simplicity  
 
 
(In this chart, the simplicity is defined as reciprocal of the number of variables and number of models. For example, our hypothetical company uses 73 variables and 5 models for prediction, the simplicity is 1/(73*5)=0.0027. The more variables or more models our prediction relies on, the lower the simplicity. This is aligned with our intuition)
 
 


[[File:RM Competition.jpg]]
[[File:RM Competition.jpg]]

Revision as of 13:55, 2 December 2019

Technology Roadmap Sections and Deliverables

Unique identifier:

  • 3RF-Random Forest

This is a “level 3” roadmap at the technology/capability level (see Fig. 8-5), where “level 1” would indicate a market level roadmap and “level 2” would indicate a product/service level technology roadmap.

Roadmap Overview

The basic high-level structure of machine learning algorithms is depicted in the figure below:

Workflow.jpg

Random Forest is a classification method/technique that is based on decision trees. Classification problems are a big part of machine learning because it is important to know what classification/group observations are in. There are many classification algorithms used in data analysis such as logistic regression, support vector machine, Bayes classifier, and decision trees. Random forest is near the top of the classifier hierarchy.

Unlike the traditional decision tree classification technique, a random forest classifier grows many decision trees in ways that can address the model. In the traditional decision tree classification, there is an optimal split, which is used to decide if a property is to be true/false. Random forest contains several such trees as a forest that operate as an ensemble and allows users to make several binary splits in the data and establish complex rules for classification. Each tree in the random forest outputs a class prediction and the class with the most votes becomes the model’s prediction. Majority wins. A large population of relatively uncorrelated models operating as a committee will outperform individual constituent models. For a random forest to perform well: (1) there needs to be a signal in the features used to build the random forest model that will show if the model does better than random guessing, and (2) the prediction and errors made by the individual trees have to have low correlations with each other.


Two methods can be used to ensure that each individual tree is not too correlated with the behavior of any other trees in the model:

  • Bagging (Bootstrap Aggregation) - Each individual tree randomly samples from the dataset with replacement, resulting in different trees.
  • Feature Randomness - Instead of every tree being able to consider every possible feature and pick the one that produces the most separation between the observations in the left and right node, the trees in the random forest can only pick from a random subset of features. This forces more variation amongst the trees in the model, which results in lower correlation across trees and more diversification.

Trees are not only trained on different sets of data, but they also use different features to make decisions.


The Random forest machine learning algorithm is a disruptive technological innovation (a technology that significantly shifts the competition to a new regime where it provides a large improvement on a​ different FOM than the mainstream product or service) because it is:

  • versatile - can do regression or classification tasks; can handle binary, categorical, and numerical features. Data requires little pre-processing (no rescaling or transforming)
  • parallelizable - process can be split to multiple machines, which results in significantly faster computation time
  • effective for high dimensional data - algorithm breaks down large datasets into smaller subsets
  • fast - each tree only uses a subset of features, so the model can use hundreds of features. It is quick to train
  • robust - bins outliers and is indifferent to non-linear features
  • low bias, moderate variance - each decision tree has a high variance but low bias; all the trees are averaged, so model how low bias and moderate variance

As a result, compared with the traditional way of having one operational model, random forest has extraordinary performance advantages based on the FOMs such as accuracy and efficiency because of the large number of models it can generate in a relatively short period of time. One interesting case of random forest increasing accuracy and efficiency is in the field of investment. In the past, investment decisions relied on specific models built by analysts. Inevitably, there were loose corners due to the reliance on single models. Nowadays, random forest (or machine learning at a larger scale) enables many models to be generated quickly to ensure the creation of a more robust decision-making solution that builds a “forest” of highly diverse "trees". This has significantly changed the industry in terms of efficiency and accuracy. For example, in a research paper by Luckyson Khaidem, et al (2016), the ROC curve shows great accuracy by modeling Apple's stock performance, using Random Forest. The paper also showed the continuous improvement of accuracy when applying machine learning. As a result, random forest/machine learning has significantly changed the way the investment sector operates.

Design Structure Matrix (DSM) Allocation

RE-DSM.png

The 3-RF tree that we can extract from the DSM above shows us that the Random Forest(3RF) is part of a larger data analysis service initiative on Machine Learning (ML), and Machine Learning is also part of a major marketing initiative (here we use housing price prediction in real estate industry as an example). Random Forest requires the following key enabling technologies at the subsystem level: Bagging (4BAG), Stacking (4STK), and Boosting (4BST). These three are the most common approaches in Random Forest, and are the technologies and resources at level 4.

Roadmap Model using OPM

We provide an Object-Process-Diagram (OPD) of the 3RF roadmap in the figure below. This diagram captures the main object of the roadmap (Random Forest), its various instances with a variety of focus, its decomposition into subsystems (data, decision trees, votes), its characterization by Figures of Merit (FOMs) as well as the main processes (defining, predicting).

OPD.jpg

An Object-Process-Language (OPL) description of the roadmap scope is auto-generated and given below. It reflects the same content as the previous figure, but in a formal natural language.

OPL.jpg

Figures of Merit

The table below show a list of FOMs by which the Random Forest models can be assessed. The first three are used to assess the accuracy of the model. Among all three, the Root Mean Squared Error (RMSE) is a FoM commonly used to measure the performance of predictive models.Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to larger errors in comparison to other performance measures such as R2 and MAE. To measure prediction, the RMSE should be calculated with out-of-sample data that was not used for model training. From a mathematical standpoint, RMSE can vary between positive infinity and zero. A very high RMSE indicates that the model is very poor at out-of-sample predictions. While an RMSE of zero is theoretically possible, this would indicate perfect prediction and is extremely unlikely in real-world situations.

FOM Table.jpg

Due to the nature of this technology, it's been challenging to quantify the growth of FOM over time, because it's not only related to the technology itself (algorithm), but also the dataset, as well as parameters selected for modeling, such as number of trees, etc. We've tried to use the same dataset to run three models with optimized parameters, and the chart below shows the difference in RMSE vs. Year.

FOM Chart.jpg

Random Forest is still being rapidly developed with tremendous efforts from many research groups around the world, to improve its performance. Meanwhile, there are also lots of efforts on the implementation and application space, such as finance, investment, service, high tech, and energy industry, etc.

Alignment with Company Strategic Drivers

In this section and following sections, we've embedded data analytics in real estate industry. Assuming we have a company in which we use random forest to predict housing price (like what Zillow etc. does). A dataset is used to the quantification discussion, which includes

o 2821 observations (houses' information)

o 73 independent variables, such as lot area, garage size, etc.

o 1 dependent variable: sale price


The table below shows strategic drivers of our hypothetical company and alignment of the 3RF technology roadmap with it.

RF-Driver.jpg

The first strategic driver indicates the company aims to improve accuracy of the model, whereas the 3RF roadmap is also optimizing the algorithm and parameters to fine tune the model. As a result, this driver is aligned with 3RF roadmap. The second driver indicates the company is looking for simplicity, and this is not aligned with the current 3RF roadmap. This discrepancy needs immediate attention and be worked to address.

Positioning of Company vs. Competition

he figure below shows a summary of other real estate companies who also use data analytics to predict housing price.

RF-Comany.jpg


The table below shows the comparison between different companies, including the models' set up.


- ntree: number of trees (models) in the forest


- mtry: number of variables examined at each split of the trees


- nodesize: minimum number of observations in each terminal node of the trees


RF-comp-small.png

Zillow-like major company tend to have very high R2, which heavily utilize data analytics in their price prediction model. To have great accuracy, it would include as many variables information as possible, as many observations as possible. This will help to "train" their model to be very smart. Meanwhile, they have sacrificed simplicity with the many variables, and large number of models built for machine learning purpose. Whereas some local "mom and pop stores" would utilize very simple models such as liner regression or CART, of which the accuracy of the model is sacrificed. Right now our hypothetical company is trying to utilize random forest to deliver good accuracy while still maintain certain level of simplicity


(In this chart, the simplicity is defined as reciprocal of the number of variables and number of models. For example, our hypothetical company uses 73 variables and 5 models for prediction, the simplicity is 1/(73*5)=0.0027. The more variables or more models our prediction relies on, the lower the simplicity. This is aligned with our intuition)


RM Competition.jpg

The Pareto Front (for this specific dataset) is shown in blue line. It's clear that the local company A (using simple linear regression)is not at the Pareto Front, instead, it would be dominated. The local company B using CART model is as simple, but the accuracy is better. Zillow-like major company and our hypothetical company both are much better in terms of accuracy.

For the target of Yr. 2020, the green line defines the expected Pareto Front at then. Comparing with current performance, the future Pareto Front is expected to improve either accuracy (R2), or simplicity, or both. The company's strategy will determine the direction, and extent of improvement.

Technical Model

An example morphologic matrix is developed for 3RF roadmap in this specific case. The purpose of such a model is to understand all the design vectors, explore the design tradespace and establish what are the active constraints in the system.

RF-Morph Matrix.jpg

Due to the special nature of this technology, there is no defined equations for the FOM’s. The FOM’s values are impacted by the model parameters, including ntree, nodsize, mtry (w/o clearly defined mathematical equations or regression equations), and the dataset, including all variables and observations. Explanation of parameters: ntree: number of CART trees in the forest mtry: number of variables examined at each split of the trees nodesize: minimum number of observations in each terminal node of the trees As a result, for any given dataset, the expressions of two FOM’s can be expressed in this way: R2~ f1(ntree, nodsize, mtry) RMSE~f2(ntree, nodsize, mtry) One dataset is used to normalize any potential difference caused by data source. This dataset is from the real estate industry with 2821 observations (houses' information),73 independent variables, such as lot area, garage size, etc., and 1 dependent variable (sale price). The goal of the model is to predict sale price based on these 73 independent variables.

Notional trends: each dot on the plot represent the output of the corresponding FOM (R2 or RMSE) at the given value of the parameter while fixing other parameters

These two charts below show the impact of mtry on the 2 FOMs:

RF-mtry-sens-1.png

RF-mtry-sens-3.png

These two charts below show the impact of ntree on the 2 FOMs:

RF-ntree-sens.png

These two charts below show the impact of node size on the 2 FOMS's:

RF-nodesize-sens.jpg


Because there is no governing equation, no partial derivative can be drawn to illustrate the difference. From previous notional trends, it’s also clear the finite difference also changes, even the direction of the difference (positive or negative). As a result, a baseline model with a set of inputs was defined; by verifying the values of the inputs in different models, tornado charts were generated to represent the finite difference of baseline caused by these changes. With all the given conditions, both ntree and mtry are very critical to the performance; both FOM’s are mostly sensitive to mtry

Tor-R2.png

Tor-RMSE.png

Financial Model

The figure below contains a sample NPV analysis underlying the 3RF roadmap. It shows the initial non-recurring cost of the project development for the first 2 years for the research projects, before revenue starts generating since Yr. 3. The ramp up period is around 6 years, until the market is situated with revenue of around $8MM/year. Total estimated project life is 17 years.

RF NPC.jpg

List of R&T Projects and Prototypes

Using Random Forests on Real-World City Data for Urban Planning in a Visual Semantic Decision Support System: A preeminent problem affecting urban planning is the appropriate choice of location to host a particular activity (either commercial or common welfare service) or the correct use of an existing building or empty space. The proposed system in this research paper combines, fuses, and merges various types of data from different sources, encodes them using a novel semantic model that can capture and utilize both low-level geometric information and higher level semantic information and subsequently feeds them to the random forests classifier, as well as other supervised machine learning models for comparisons. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6567884/


Random Forest ensembles for detection and prediction of Alzheimer's disease with a good between-cohort robustness: Computer-aided diagnosis of Alzheimer's disease (AD) is a rapidly developing field of neuroimaging with strong potential to be used in practice. In this context, assessment of models' robustness to noise and imaging protocol differences together with post-processing and tuning strategies are key tasks to be addressed in order to move towards successful clinical applications. In this study, we investigated the efficacy of Random Forest classifiers trained using different structural MRI measures, with and without neuroanatomical constraints in the detection and prediction of AD in terms of accuracy and between-cohort robustness. https://www-sciencedirect-com.ezproxyberklee.flo.org/science/article/pii/S2213158214001326


A Random Forest approach to predict the spatial distribution of sediment pollution in an estuarine system: Modeling the magnitude and distribution of sediment-bound pollutants in estuaries is often limited by incomplete knowledge of the site and inadequate sample density. To address these modeling limitations, a decision-support tool framework was conceived that predicts sediment contamination from the sub-estuary to broader estuary extent. For this study, a Random Forest (RF) model was implemented to predict the distribution of a model contaminant. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0179473


Towards Automatic Personality Prediction Using Facebook Like Categories: Effortlessly accessible digital records of behavior such as Facebook Likes can be obtained and utilized to automatically distinguish a wide range of highly delicate personal traits including: life satisfaction, cultural ethnicity, political views, age, gender and personality traits. The analysis presented based on a dataset of over 738,000 users who conferred their Facebook Likes, social network activities, egocentric network, demographic characteristics, and the results of various psychometric tests for our extended personality analysis. The proposed model uses unique mapping technique between each Facebook Like object to the corresponding Facebook page category/sub-category object, which is then evaluated as features for a set of machine learning algorithms to predict individual psycho-demographic profiles from Likes. The model properly distinguishes between a religious and nonreligious individual in 83% of circumstances, Asian and European in 87% of situations, and between emotional stable and emotion unstable in 81% of situations. https://arxiv.org/ftp/arxiv/papers/1812/1812.04346.pdf


CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network: With recent advances in DNA sequencing technologies, fast acquisition of large-scale genomic data has become commonplace. For cancer studies, in particular, there is an increasing need for the classification of cancer type based on somatic alterations detected from sequencing analyses. However, the ever-increasing size and complexity of the data make the classification task extremely challenging. In this study, we evaluate the contributions of various input features, such as mutation profiles, mutation rates, mutation spectra and signatures, and somatic copy number alterations that can be derived from genomic data, and further utilize them for accurate cancer type classification. We introduce a novel ensemble of machine learning classifiers, called CPEM (Cancer Predictor using an Ensemble Model), which is tested on 7,002 samples representing over 31 different cancer types collected from The Cancer Genome Atlas (TCGA) database. We first systematically examined the impact of the input features. Features known to be associated with specific cancers had relatively high importance in our initial prediction model. We further investigated various machine learning classifiers and feature selection methods to derive the ensemble-based cancer type prediction model achieving up to 84% classification accuracy in the nested 10-fold cross-validation. Finally, we narrowed down the target cancers to the six most common types and achieved up to 94% accuracy. https://www.nature.com/articles/s41598-019-53034-3


Facebook Engineering: Under the hood - Suicide prevention tools powered by AI: Suicide is the second most common cause of death for people ages 15-29. Research has found that one of the best ways to prevent suicide is for those in distress to hear from people who care about them. Facebook is well positioned — through friendships on the site — to help connect a person in distress with people who can support them. It’s part of our ongoing effort to help build a safe community on and off Facebook. We recently announced an expansion of our existing suicide prevention tools that use artificial intelligence to identify posts with language expressing suicidal thoughts. We’d like to share more details about this, as we know that there is growing interest in Facebook’s use of AI and in the nuances associated with working in such a sensitive space. https://engineering.fb.com/ml-applications/under-the-hood-suicide-prevention-tools-powered-by-ai/

Key Publications, Presentations and Patents

Links to Key Patents on Random Forest:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&d=PALL&s1=6009199.PN.

http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&d=PG01&s1=20120321174.PGNR.

Key Publications on the Development of Random Forest:

https://web.archive.org/web/20160417030218/http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf

http://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdf

https://ieeexplore.ieee.org/document/709601

https://link.springer.com/content/pdf/10.1023/A:1007607513941.pdf

https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

Key Reference on Data Analytics in Real Estate Industry:

https://pdfs.semanticscholar.org/782d/3fdf15f5ff99d5fb6acafb61ed8e1c60fab8.pdf

https://towardsdatascience.com/home-value-prediction-2de1c293853c

https://medium.com/@santhoshetty/predicting-housing-prices-4969f6b0945

http://terpconnect.umd.edu/~lzhong/INST737/milestone2_presentation.pdf

https://nycdatascience.com/blog/student-works/housing-price-prediction-using-advanced-regression-analysis/

Technology Strategy Statement

Our target is to develop a real estate digital platform with integration between traditional real estate industry and cutting-edge data analytics. To achieve the target of an accurate prediction of the optimal price of housing for broader customers, we will invest in two R&D projects. The first is an algorithm optimization project to improve the accuracy of current machine learning to R2 of 0.97 in sample data base. The second project is a user interface improvement project to make our methodology more simpler to be adopted by users. These will enable us to reach our goal in Yr. 2023.