Key point 1: Basic of Multiple Regression and Underlying Assumptions:
- Describe the types of investment problems addressed by multiple linear regression and the regression process.
- Formulate a multiple linear regression model, describe the relation between the dependent variable and several independent variables, and interpret estimated regression coefficients.
- Explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions.
Types of Regression Models:
- The outcome of logistic model and liner model is different, logistic model is for binary outcome (e.g., yes or no, true or faluse) and liner model is for continuous outcome (e.g., investment level, unemployment rate);
- Formula of liner model and logistic model.
\(\) $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon$$
\(\) $$ \small P(Y = 1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p)}}$$
where $\small P(Y = 1 \mid X)$ is the probability that the dependent variable $\small Y$ is 1 given the vector of predictors $\small X$
Why it looked not like a liner model? Let’s change the formula to log odds:
Original equation:
\[\small
P(Y = 1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p)}}
\]
Transforming into the log odds form (Log odds):
1. Let \( \small P = P(Y = 1 \mid X) \), then the log odds Logit($\small P$) is expressed as:
\[\small
\text{Logit}(P) = \ln\left(\frac{P}{1-P}\right)
\]
2. By substituting $\small P$ from the original equation, the log odds form can be simplified as:
\[\small
\text{Logit}(P) = \ln\left(\frac{\frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p)}}}{1 – \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p)}}}\right)
\]
3. Further simplifying:
\[\small
\text{Logit}(P) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p
\]
Therefore, the final log odds regression form is:
\[\small
\ln\left(\frac{P(Y = 1 \mid X)}{1 – P(Y = 1 \mid X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p
\]
Interpret coefficients:
- The coefficients of logistic model means 1 unit change will lead a change of log odd of probability that the dependent variable equals 1
Assumptions and residual plots:
- Linearity: The relationship between the dependent variable and the independent variables is linear.
- Homoskedasticity: The variance of the regression residuals is the same for all observations.
- Independence of errors: The observations are independent of one another. This implies the regression residuals are uncorrelated across observations.
- Normality: The regression residuals are normally distributed.
- Independence of independent variables:
a. Independent variables are not random.
b. There is no exact linear relation between two or more of the independent variables or combinations of the independent variables
- The first plot compare variables which test the Independence of independent variables,
- The second plot compare residuals and predicted dependent variable which test the Homoskedasticity
- The third plot is Q-Q plot which visualize the distribution of a variable by comparing it to a normal
distribution.
Key point 2: Evaluating Regression Model Fit and Interpreting Model Results
- Evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit
- Formulate hypotheses on the significance of two or more coefficients in a multiple regression model and interpret the results of the joint hypothesis tests
- Calculate and interpret a predicted value for the dependent variable, given the estimated regression model and assumed values for the independent variable
Fitness of model
- AIC: AIC is preferred if the purpose is prediction
- BIC: BIC is preferred if goodness of fit is the goal, and lower values of both measures are better.
- R-square
F-statistics measurement
- Restricted model: baseline model whose coefficients of other omitted indenpedent variabels are posed restrictions.
- Unrestricted model: model based on baseline model with addtional indenpedent variabels.
- Null hypothesis of F test: all coefficents equal to zero
\(\) $F = \frac{\left(\text{Sum of squares error restricted model} – \text{Sum of squares error unrestricted model}\right) / q}{\text{Sum of squares error unrestricted model} / (n – k – 1)}$
where q is the number of new indepedent variables (addtional degree of freedom), and the (n – k – 1) is the degree of freedom (df) of the unrestricted model
$F = \frac{\left( \text{SSE}_{\text{restricted}} – \text{SSE}_{\text{unrestricted}} \right) / (\text{df}_{\text{restricted}} – \text{df}_{\text{unrestricted}})}{\text{SSE}_{\text{unrestricted}} / \text{df}_{\text{unrestricted}}}$
Calculate and interpret a predicted value
Key point 3: Model Misspecification
- Describe how model misspecification affects the results of a regression analysis and how to avoid common forms of misspecification
- Explain the types of heteroskedasticity and how it affects statistical inference
- Explain serial correlation and how it affects statistical inference
- Explain multicollinearity and how it affects regression analysis
Heteroskedasticity
- Unconditional heteroskedasticity occurs when the error variance is not correlated with the regression’s independent variables, which creates no major problems
- Conditional heteroskedasticity is more problematic for statistical inference—when the error variance is correlated with (conditional on) the values of the independent
variables. - Testing for Conditional Heteroskedasticity: Breusch–Pagan (BP) test
- B-P stat = N*R2
- Null hypothesis: No Conditional heteroskedasticity
Serial correlation (or autocorrelation)
- Standard error of autocorrelation = square root of 1/T where T = number of observations
- B-G (Breusch–Godfrey) test
- Null hypothesis: No Autocorrelation
- Impact of autocorrelation on coefficients:
- 1. If the independent variable is a lagged value of the dependent variable, then regression coefficient estimates are invalid and coefficients’ standard
errors are deflated, so t-statistics (or F-stat) are inflated - 2. If ndependent variable is NOT a lagged value of the dependent variable, coefficient estimates are valid, inflated F-stat and T-stat.
- 1. If the independent variable is a lagged value of the dependent variable, then regression coefficient estimates are invalid and coefficients’ standard
- D-W (Durbin–Watson) test
- Null hypothesis: No Autocorrelation
- CV: 0-2: positive serial correlation, we can have a lower DW-stats
- around 2: no serial correlation
- 2-4: negative serial correlation, we can have a higher DW-stats
Multicollinearity
- VIF: 5 or 10
Key point 4: Extensions of Multiple Regression
- Describe influence analysis and methods of detecting influential data points
- Formulate and interpret a multiple regression model that includes qualitative independent variables
- Formulate and interpret a logistic regression model
Influential data
- A high-leverage point, a data point having an extreme value of an independent variable
An outlier, a data point having an extreme value of the dependent variable - leverage (hij) for independent variable: CV = 3(k+1/n), k is the number of independent variables and the n is the number of observations
- Studentized residuals for dependent variable: CV = 2.63 -3
- Cook’s distance (Cook’s D) for both; CV = square root of (k/n)
Qualitative independent (Dummy) variables
- Intercept dummy: Dummy variable in regressions
- Slope dummy: create a interaction term between D and X (where X is independent variable and D is dummy variable)
Calculate logistic regression
- Calculate P (probability of event happens) by using mean values
\(\) $$ \small P(Y = 1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p)}}$$
where $\small P(Y = 1 \mid X)$ is the probability that the dependent variable $\small Y$ is 1 given the vector of predictors $\small X$
\[\small
\ln\left(\frac{P(Y = 1 \mid X)}{1 – P(Y = 1 \mid X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p
\]
- likelihood ratio (LR) test (for logistic models): Higher is better
Key point 5: Time-Series Analysis
- Calculate and evaluate the predicted trend value for a time series, modeled as either a linear trend or a log-linear trend, given the estimated trend coefficients
- Describe factors that determine whether a linear or a log-linear trend should be used with a particular time series and evaluate limitations of trend models
- Explain the requirement for a time series to be covariance stationary and describe the significance of a series that is not stationary
- Describe the structure of an autoregressive (AR) model of order p and calculate one- and two-period-ahead forecasts given the estimated coefficients
- Explain how autocorrelations of the residuals can be used to test whether the autoregressive model fits the time series
- Explain mean reversion and calculate a mean-reverting level
- Contrast in-sample and out-of-sample forecasts and compare the forecasting accuracy of different time-series models based on the root mean squared error criterion
- Explain the instability of coefficients of time-series models
- Describe characteristics of random walk processes and contrast them to covariance stationary processes
- Describe implications of unit roots for time-series analysis, explain when unit roots are likely to occur and how to test for them, and demonstrate how a time series with a unit root can be transformed so it can be analyzed with an AR model
- Describe the steps of the unit root test for nonstationarity and explain the relation of the test to autoregressive time-series models
- Explain how to test and correct for seasonality in a time-series model and calculate and interpret a forecasted value using an AR model with a seasonal lag
- Explain autoregressive conditional heteroskedasticity (ARCH) and describe how ARCH models can be applied to predict the variance of a time series
- Explain how time-series variables should be analyzed for nonstationarity and/or cointegration before use in a linear regression
- Determine an appropriate time-series model to analyze a given investment problem and justify that choice
Calculate and evaluate the predicted trend value for a time series
- Like OLS
Test for serial correlation
- D-W (Durbin–Watson) test
- Null hypothesis: No Autocorrelation
- CV: 0-2: positive serial correlation, we can have a lower DW-stats
- around 2: no serial correlation
- 2-4: negative serial correlation, we can have a higher DW-stats
The following is a step-by-step guide to building a model to predict a time series.
- Understand the investment problem you have, and make an initial choice of model. One alternative is a regression model that predicts the future behavior of a variable based on hypothesized causal relationships with other variables (Causal effects). Another is a time-series model that attempts to predict the future behavior of a variable based on the past behavior (Time series) of the same variable.
- If you have decided to use a time-series model, compile the time series and plot it to see whether it looks covariance stationary. The plot might show important deviations from covariance stationarity, including the following:
- a linear trend,
- an exponential trend,
- seasonality, or
- a significant shift in the time series during the sample period (for example,a change in mean or variance).
- If you find no significant seasonality or shift in the time series, then perhaps either a linear trend or an exponential trend will be sufficient to model the time series. In that case, take the following steps:
- Determine whether a linear or exponential trend seems most reasonable (usually by plotting the series).Estimate the trend.
- Compute the residuals.
- Use the Durbin–Watson statistic to determine whether the residuals have significant serial correlation. If you find no significant serial correlation in the residuals, then the
- trend model is sufficient to capture the dynamics of the time series and you can use that model for forecasting.
- If you find significant serial correlation in the residuals from the trend model, use a more complex model, such as an autoregressive model. First, however, reexamine whether the time series is covariance stationary. The following is a list of violations of stationarity, along with potential methods to adjust the time series to make it covariance stationary:
- If the time series has a linear trend, first-difference the time series.
- If the time series has an exponential trend, take the natural log of the time series and then first-difference it.
- If the time series shifts significantly during the sample period, estimate different time-series models before and after the shift.
- If the time series has significant seasonality, include seasonal lags (discussed in Step 7).
- After you have successfully transformed a raw time series into a covariance-stationary time series, you can usually model the transformed series with a short autoregression. To decide which autoregressive model to use, take the following steps:
- Estimate an AR(1) model.
- Test to see whether the residuals from this model have significant serial correlation.
- If you find no significant serial correlation in the residuals, you can use the AR(1) model to forecast.
- If you find significant serial correlation in the residuals, use an AR(2) model and test for significant serial correlation of the residuals of the AR(2) model.
- If you find no significant serial correlation, use the AR(2) model.
- If you find significant serial correlation of the residuals, keep increasing the order of the AR model until the residual serial correlation is no longer significant.
- Your next move is to check for seasonality. You can use one of two approaches:
- Graph the data and check for regular seasonal patterns.
- Examine the data to see whether the seasonal autocorrelations of the residuals from an AR model are significant (for example, the fourth autocorrelation for quarterly data) and whether the autocorrelations before and after the seasonal autocorrelations are significant. To correct for seasonality, add seasonal lags to your AR model. For example, if you are using quarterly data, you might add the fourth lag of a time series as an additional variable in an AR(1) or an AR(2) model.
- Next, test whether the residuals have autoregressive conditional heteroskedasticity. To test for ARCH(1), for example, do the following:
- Regress the squared residual from your time-series model on a lagged value of the squared residual.
- Test whether the coefficient on the squared lagged residual differs significantly from 0.
- If the coefficient on the squared lagged residual does not differ significantly from 0, the residuals do not display ARCH and you can rely on the standard errors from your time-series estimates.
- If the coefficient on the squared lagged residual does differ significantly from 0, use generalized least squares or other methods to correct for ARCH.
- Finally, you may also want to perform tests of the model’s out-of-sample forecasting performance to see how the model’s out-of-sample performance compares to its in-sample performance.
Mean-reverting level
- Time series shows mean reversion if it tends to fall when its level is above its mean and rise when its level is below its mean
\(\)For an AR(1) model ($x_{t+1} = b_0 + b_1 x_t$), the equality $x_{t+1} = x_t$ implies the level $x_t = b_0 + b_1 x_t$ or that the mean-reverting level, $x_t$, is given by
\[ x_t = \frac{b_0}{1 – b_1} \]
So the AR(1) model predicts that the time series will stay the same if its current value is $\frac{b_0}{1 – b_1}$, increase if its current value is below $\frac{b_0}{1 – b_1}$, and decrease if its current value is above $\frac{b_0}{1 – b_1}$.
Random walk
- A random walk is a time series in which the value of the series in one periodis the value of the series in the previous period plus an unpredictable random error.
- If the time series is a random walk, it is not covariance stationary.
- A random walk with drift is a random walk with a nonzero intercept term.
- All random walks have unit roots.
- If a time series has a unit root, then it will not be covariance stationary.
- No constant expected mean value or constant variance.
Unit root
- If a time series has a unit root, we can sometimes transform the time series into one that is covariance stationary by first-differencing the time series;
- we may then be able to estimate an autoregressive model for the first-differenced series.
- Two times series without unit root: safely use OLS
- Two times series with one unit root: no OLS
- Two times series both with one unit root: the time series are cointegrated, we may safely use linear regression; however, if they are not cointegrated, we should not use linear regression. The (Engle–Granger) Dickey–Fuller test can be used to determine whether time series are cointegrated.
Covariance-stationary
- A time series is covariance stationary if the following three conditions are satisfied:
- First, the expected value of the time series must be constant and finite in all periods.
- Second, the variance of the time series must be constant and finite in all periods.
- Third, the covariance of the time series with itself for a fixed number of periods in the past or future must be constant and finite in all periods.
- Inspection of a nonstationary time-series plot may reveal an upward or downward trend (nonconstant mean) and/or nonconstant variance.
- The use of linear regression to estimate an autoregressive time-series model is not valid unless the time series is covariance stationary.
Forecast accuracy
- The root mean squared error (RMSE), defined as the square root of the average squared forecast error, is a criterion for comparing the forecast accuracy of different time-series models
Key point 6: Machine Learning
- Describe supervised machine learning, unsupervised machine learning, and deep learning
- Describe overfitting and identify methods of addressing it
- Describe supervised machine learning algorithms—including penalized regression, support vector machine, k-nearest neighbor, classification and regression tree (CART), ensemble learning, and random forest—and determine the problems for which they are best suited
- Describe unsupervised Dmachine learning algorithms—including principal components analysis, k-means clustering, and hierarchical clustering—and determine the problems for which they are best suited
- Describe neural networks, deep learning nets, and reinforcement learning
Supervised learning
- Supervised learning depends on having labeled training data (with inputs and outputs)
- the dependent variable (Y) is the target and the independent variables (X’s) are known as features.
Unsupervised learning
- With unsupervised learning, algorithms are trained with no labeled data, It have inputs (X’s) that are used for analysis without any target (Y) being supplied.
- Dimension reduction focuses on reducing the number of features while retaining variation across observations to preserve the information contained in that variation.
- Clustering focuses on sorting observations into groups (clusters) such that observations in the same cluster are more similar to each other than they are to observations in other clusters.
Deep learning
- In deep learning, sophisticated algorithms address complex tasks, such as image classification, face recognition, speech recognition, and natural language processing. Deep learning is based on neural networks (NNs), also called artificial neural networks (ANNs)
- In reinforcement learning, a computer learns from interacting with itself or data generated by the same algorithm.
Evaluating ML algorithm performance
- Generalization and Overfitting: Bias error, or the degree to which a model fits the training data. Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and high in-sample error.
- Variance error, or how much the model’s results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance, causing overfitting and high out-of-sample error.
- Base error due to randomness in the data.
- A learning curve plots the accuracy rate (= 1 – error rate) in the validation or test samples
- A fitting curve, which shows in- and out-of-sample error rates (Ein and Eout) on the y-axis plotted against model complexity on the x-axis
Supervised ML Algorithms
- LASSO (Penalized regression – LASSO is one of the most popular penalized regression which include a penalized term which increase as the increase of features to surprise the error of regression coefficients ) – when lambda equals to 0, the regression = OLS
- Regularization describes methods that reduce statistical variability in high-dimensional data estimation problems—in this case, reducing regression coefficient estimates toward zero and thereby avoiding complex models and the risk of overfitting.
- Support vector machine (SVM)
- Some observations may fall on the wrong side of the boundary and be misclassified by the SVM algorithm. The SVM algorithm handles this problem by an adaptation called soft margin classification (Exhibit 7)
- K-nearest neighbor (KNN) is a supervised learning technique used most often for classification and sometimes for regression. The idea is to classify a new observation by finding similarities (“nearness”) between this new observation and the existing data.
- Classification and regression tree (CART) is another common supervised machine learning technique that can be applied to predict either a categorical target variable, producing a classification tree, or a continuous target variable, producing a regression tree.
- CART do not need initial hyperparameter (like K) and similarity (or distance) compared with KNN (Q6)
- The root node and each decision node represent a single feature (f) and a cutoff value (c) for that feature.
- The CART algorithm chooses the feature and the cutoff value at each node that generates the widest separation of the labeled data to minimize classification error (e.g., by a criterion, such as mean-squared error).
- Regularization can occur via a pruning technique that can be used afterward to reduce the size of the tree. Sections of the tree that provide little classifying power are pruned (i.e., cut back or removed).
- The technique of combining the predictions from a collection of models is called ensemble learning, and the combination of multiple learning algorithms is known as the ensemble method.
- Ensemble learning can be divided into two main categories: (1) aggregation of heterogeneous learners (i.e., different types of algorithms combined with a voting classifier) or (2) aggregation of homogeneous learners (i.e., a combination of the same algorithm using different training data that are based, for example, on a bootstrap aggregating, or bagging, technique, as discussed later)
- A majority-vote classifier will assign to a new data point the predicted label with the most votes. For example, if the SVM and KNN models are both predicting the category “stock outperformance” and the CART model is predicting the category “stock underperformance,” then the majority-vote classifier will choose stock outperformance.”
- Bootstrap aggregating (or bagging) is a technique whereby the original training dataset is used to generate n new training datasets or bags of data.
- Random forest: A random forest classifier is a collection of a large number of decision trees trained via a bagging method
- For example, a CART algorithm would be trained using each of the n independent datasets (from the bagging process) to generate the multitude of different decision trees that make up the random forest classifier
Unsupervised ML algorithms and principal component analysis (PCA)
Principal Components Analysis (PCA)
- PCA is useful in dimension reduction. A composite variable is a variable that combines two or more variables that are statistically strongly related to each other.
- In the context of PCA, eigenvectors define new, mutually uncorrelated composite variables that are linear combinations of the original features.
- An eigenvalue gives the proportion of total variance in the initial data that is explained by each eigenvector.
- The PCA algorithm orders the eigenvectors from highest to lowest according to their eigenvalues—that is, in terms of their usefulness in explaining the total variance in the initial data (this will be shown shortly using a scree plot).
- With respect to PC1, a perpendicular line dropped from each data point to PC1 shows the vertical distance between the data point and PC1, representing projection error.
- Moreover, the distance between each data point in the direction that is parallel to PC1 represents the spread or variation of the data along PC1.
- The PCA algorithm operates in such a way that it finds PC1 by selecting the line for which the sum of the projection errors for all data points is minimized and for which the sum of the spread between all the data is maximized.
- As a consequence of these selection criteria, PC1 is the unique vector that accounts for the largest proportion of the variance in the initial data.
- Scree plots, which show the proportion of total variance in the data explained by each principal component, can be helpful in this regard (see the accompanying sidebar)
Cluster
- A cluster contains a subset of observations from the dataset such that all the observations within the same cluster are
deemed “similar.” - K-means is an algorithm that repeatedly partitions observations into a fixed number “k” of non-overlapping clusters.
- The number of clusters, k, is a model hyperparameter.
- Each cluster is characterized by its centroid (i.e., center), and each observation is assigned by the algorithm to the cluster with the centroid to which that observation is closest.
- The algorithm groups the observations in the following steps:
- 1. K-means starts by determining the position of the k (here, 3) initial random centroids.
- 2. The algorithm then analyzes the features for each observation. Based on the distance measure that is used, k-means assigns each observation to its closest centroid, which defines a cluster.
- 3. Using the observations within each cluster, k-means then calculates the new (k) centroids for each cluster, where the centroid is the average value of their assigned observations.
- 4. K-means then reassigns the observations to the new centroids, redefining the clusters in terms of included and excluded observations.
- 5. The process of recalculating the new (k) centroids for each cluster is reiterated.
- 6. K-means then reassigns the observations to the revised centroids, again redefining the clusters in terms of observations that are included and excluded
- Hierarchical clustering is an iterative procedure used to build a hierarchy of clusters.
- Agglomerative clustering (or bottom-up hierarchical clustering) begins with each observation being treated as its own cluster
- By contrast, divisive clustering (or top-down hierarchical clustering) starts with all the observations belonging to a single cluster
- A type of tree diagram for visualizing a hierarchical cluster analysis is known as a dendrogram, which highlights the hierarchical relationships among the clusters.
Neural networks, deep learning nets, and reinforcement learning
Neural networks
- Neural networks (also called artificial neural networks, or ANNs) are a highly flexible type of ML algorithm that have been successfully applied to a variety of tasks characterized by non-linearities and complex interactions among features
- Note that for neural networks, the feature inputs would be scaled (i.e., standardized) to account for differences in the units of the data.
- Each node has, conceptually, two functional parts: a summation operator and an activation function.
- Firstly, each yellow neuro receive four inputs and summation operator weighted the four input to a net input
- Secondly, activation function tame the weight of each neuro (like the light dimmer switch). The process of transmission just described is referred to as forward propagation.
- Thirdly, compare predicted value and actual value and tame the weights in input layer (by summation operater) and hidden layer (by activation fucntion), and to decrase the error. If the process of adjustment works backward through the layers of the network, this process is called backward propagation). Learning takes place through this process of adjustment to the network weights with the aim of reducing total error. The gist of the updating can be expressed informally as:
Deep learning
- Neural networks with many hidden layers—at least 2 but potentially more than 20—are known as deep neural networks (DNNs)
Reinforcement learning
- The RL framework involves an agent that is designed to perform actions that will maximize its reward sover time, taking into consideration the constraints of its environment. (AlphaGo)
- The success of RL in dealing with the complexities of financial markets is still an open question.
Key point 7: Big Data Projects
- Identify and explain steps in a data analysis project
- Describe objectives, steps, and examples of preparing and wrangling data
- Evaluate the fit of a machine learning algorithm
- Describe objectives, methods, and examples of data exploration
- Describe methods for extracting, selecting and engineering features from textual data
- Describe objectives, steps, and techniques in model training
- Describe preparing, wrangling, and exploring text-based data for financial forecasting
4 Vs
- Volume refers to the quantity of data.
- Variety pertains to the array of available data sources.
- Velocity is the speed at which data are created.
- Veracity relates to the credibility and reliability of different data sources.
Steps in a data analysis project
- We begin with the top half of Exhibit 1, which shows the traditional (i.e., with structured data) ML Model Building Steps:
- 1. Conceptualization of the modeling task. This crucial first step entails determining what the output of the model should be (e.g., whether the price of a stock will go up/down one week from now), how this model will be used and by whom, and how it will be embedded in existing or new business processes.
- 2. Data collection. The data traditionally used for financial forecasting tasks are mostly numeric data derived from internal and external sources. Such data are typically already in a structured tabular format, with columns of features, rows of instances, and each cell representing a particular value.
- 3. Data preparation and wrangling. This step involves cleansing and preprocessing of the raw data. Cleansing may entail resolving missing values, outof-range values, and the like. Preprocessing may involve extracting, aggregating, filtering, and selecting relevant data columns.
- 4. Data exploration. This step encompasses exploratory data analysis, feature selection, and feature engineering.
- 5. Model training. This step involves selecting the appropriate ML method (or methods), evaluating performance of the trained model, and tuning the model accordingly.
- The TextML Model Building Steps used for the unstructured data sources of big data are shown in the bottom half of Exhibit 1. They differ from those used for traditional data sources and are typically intended to create output information that is structured. The major differences in the Text ML Model Building Steps are in the first four steps:
- 1. Text problem formulation. Analysts begin by determining how to formulate the text classification problem, identifying the exact inputs and outputs for the model. Perhaps we are interested in computing sentiment scores (structured output) from text (unstructured input). Analysts must also decide how the text ML model’s classification output will be utilized.
- 2. Data (text) curation. This step involves gathering relevant external text data via web services or web spidering (scraping or crawling) programs that extract raw content from a source, typically web pages. Annotation of the text data with high-quality, reliable target (dependent) variable labels might also be necessary for supervised learning and performance evaluation purposes. For instance, experts might need to label whether a given expert assessment of a stock is bearish or bullish.
- 3. Text preparation and wrangling. This step involves critical cleansing and preprocessing tasks necessary to convert streams of unstructured data into a format that is usable by traditional modeling methods designed for structured inputs.
- 4. Text exploration. This step encompasses text visualization through techniques, such as word clouds, and text feature selection and engineering.
Data preparation and wrangling
Structured Data
- Readme files are text files provided with the raw data that contain information related to a data file.
- External data usually can be accessed through an application programming interface (API)—a set of well-defined methods of communication between various software components—or the vendors can deliver the required data in the form of csv files or other formats (as previously mentioned).
- Data Preparation (Cleansing): This is the initial and most common task in data preparation that is performed on raw data. Data cleansing is the process of examining, identifying, and mitigating errors in raw data.
- Data Wrangling (Preprocessing): This task performs transformations and critical processing steps on the cleansed data to make the data ready for ML model training.
- Metadata (data that describes and gives information about other data)
- Data Preparation (Cleansing):
- The possible errors in a raw dataset include the following:
- 1. Incompleteness error is where the data are not present, resulting in missing data. This can be corrected by investigating alternate data sources. Missing values and NAs (not applicable or not available values) must be either omitted or replaced with “NA” for deletion or substitution with imputed values during the data exploration stage. The most common imputations are mean, median, or mode of the variable or simply assuming zero. In Exhibit 3, rows 4 (ID 3), 5 (ID 4), 6 (ID 5), and 7 (ID 6) are incomplete due to missing values in either Gender, Salary, Other Income, Name (Salutation), and State columns.
- 2. Invalidity error is where the data are outside of a meaningful range, resulting in invalid data. This can be corrected by verifying other administrative data records. In Exhibit 3, row 5 likely contains invalid data as the date of birth is out of the range of the expected human life span.
- 3. Inaccuracy error is where the data are not a measure of true value. This can be rectified with the help of business records and administrators. In Exhibit 3, row 5 is inaccurate (it shows “Don’t Know”); in reality, every person either has a credit card or does not.
- 4. Inconsistency error is where the data conflict with the corresponding data points or reality. This contradiction should be eliminated by clarifying with another source. In Exhibit 3, row 3 (ID 2) is likely to be inconsistent as the Name column contains a female title and the Gender column contains male.
- 5. Non-uniformity error is where the data are not present in an identical format. This can be resolved by converting the data points into a preferable standard format. In Exhibit 3, the data under the Date of Birth column is present in various formats. The data under the Salary column may also be non-uniform as the monetary units are ambiguous; the dollar symbol can represent US dollar, Canadian dollar, or others. 6. Duplication error is where duplicate observations are present. This can be corrected by removing the duplicate entries. In Exhibit 3, row 6 is a duplicate as the data under Name and Date of Birth columns are identical to the ones in row 3, referring to the same customer.
- The possible errors in a raw dataset include the following:
- Data Wrangling (Preprocessing)
- To make structured data ready for analyses, the data should be preprocessed. Data preprocessing primarily includes transformations and scaling of the data. These processes are exercised on the cleansed dataset. The following transformations are common in practice:
- 1. Extraction: A new variable can be extracted from the current variable for ease of analyzing and using for training the ML model. In Exhibit 4, the Date of Birth column consists of dates that are not directly suitable for analyses. Thus, an additional variable called “Age” can be extracted by calculating the number of years between the present day and date of birth.
- 2. Aggregation: Two or more variables can be aggregated into one variable to consolidate similar variables. In Exhibit 4, the two forms of income, Salary and Other Income, can be summed into a single variable called Total Income.
- 3. Filtration: The data rows that are not needed for the project must be identified and filtered. In Exhibit 4, row 7 (ID 8) has a non-US state; however, this dataset is for the US-based bank customers where it is required to have a US address.
- 4. Selection: The data columns that are intuitively not needed for the project can be removed. This should not be confused with feature selection, which is explained later. In Exhibit 4, Name and Date of Birth columns are notrequired for training the ML model. The ID column is sufficient to identify the observations, and the new extracted variable Age replaces the Date of Birth column.
- 5. Conversion: The variables can be of different types: nominal, ordinal, continuous, and categorical. The variables in the dataset must be converted into appropriate types to further process and analyze them correctly. This is critical for ML model training. Before converting, values must be stripped out with prefixes and suffixes, such as currency symbols. In Exhibit 4, Name is nominal, Salary and Income are continuous, Gender and Credit Card are categorical with 2 classes, and State is nominal. In case row 7 is not excluded, the Salary in row 7 must be converted into US dollars. Also, the conversion task applies to adjusting time value of money, time zones, and others when present.
- To make structured data ready for analyses, the data should be preprocessed. Data preprocessing primarily includes transformations and scaling of the data. These processes are exercised on the cleansed dataset. The following transformations are common in practice:
- Outliers:
- When extreme values and outliers are simply removed from the dataset, it is known as trimming (also called truncation). For example, a 5% trimmed dataset is one for which the 5% highest and the 5% lowest values have been removed.
- When extreme values and outliers are replaced with the maximum (for large value outliers) and minimum (for small value outliers) values of data points that are not outliers, the process is known as winsorization.
- Scaling is a process of adjusting the range of a feature by shifting and changing the scale of data.
- Normalization is sensitive to outliers, so treatment of outliers is necessary before normalization is performed. Normalization can be used when the distribution of the data is not known.
- Standardization is relatively less sensitive to outliers as it depends on the mean and standard deviation of the data. However, the data must be normally distributed to use standardization.
Unstructured data are not organized into any systematic format that can be processed by computers directly.
Unstructured data – Text Preparation (Cleansing)
- A regular expression (regex) is a series that contains characters in a particular order. Regex is used to search for patterns of interest in a given text. For example, a regex “<.*?>” can be used to find all the html tags that are present in the form of <…> in text.
- The following steps describe the basic operations in the text cleansing process.
- 1. Remove html tags: Most of the text data are acquired from web pages, and the text inherits html markup tags with the actual content. The initial task is to remove (or strip) the html tags that are not part of the actual text using programming functions or using regular expressions. In Exhibit 7, is an html tag that can be identified by a regex and be removed. Note that it is not uncommon to keep some generic html tags to maintain certain formatting meaning in the text.
- 2. Remove Punctuations: Most punctuations are not necessary for text analysis and should be removed. However, some punctuations, such as percentage signs, currency symbols, and question marks, may be useful for ML model training. These punctuations should be substituted with such annotations as /percentSign/, /dollarSign/, and /questionMark/ to preserve their grammatical meaning in the text. Such annotations preserve the semantic meaning of important characters in the text for further text processing and analysis stages. It is important to note that periods (dots) in the text need to be processed carefully. There are different circumstances for periods to be present in text—characteristically used for abbreviations, sentence boundaries, and decimal points. The periods and the context in which they are used need to be identified and must be appropriately replaced or removed. In general, periods after abbreviations can be removed, but the periods separating sentences should be replaced by the annotation /endSentence/. Some punctuations, such as hyphens and underscores, can be kept in the text to keep the consecutive words intact as a single term (e.g., e-mail). Regex are often used to remove or replace punctuations.
- 3. Remove Numbers: When numbers (or digits) are present in the text, they should be removed or substituted with an annotation /number/. This helps inform the computer that a number is present, but the actual value of the number itself is not helpful for categorizing/analyzing the text. Such operations are critical for ML model training. Otherwise, the computers will treat each number as a separate word, which may complicate the analyses or add noise. Regex are often used to remove or replace numbers. However, the number and any decimals must be retained where the outputs of interest are the actual values of the number. One such text application is information extraction (IE), where the goal is to extract relevant information from a given text. An IE task could be extracting monetary values from financial reports, where the actual number values are critical.
- 4. Remove white spaces: It is possible to have extra white spaces, tab spaces, and leading and ending spaces in the text. The extra white spaces may be introduced after executing the previously mentioned operations. These should be identified and removed to keep the text intact and clean. Certain functions in programming languages can be used to remove unnecessary white spaces from the text. For example, the text mining package in R offers a stripwhitespace function.
Unstructured data – Text Wrangling (Preprocessing)
- The normalization process in text processing involves the following:
- 1. Lowercasing the alphabet removes distinctions among the same words due to upper and lower cases. This action helps the computers to process the same words appropriately (e.g., “The” and “the”).
- 2. Stop words are such commonly used words as “the,” “is,” and “a.” Stop words do not carry a semantic meaning for the purpose of text analyses and ML training. However, depending on the end-use of text processing, for advance text applications it may be critical to keep the stop words in the text in order to understand the context of adjacent words. For ML training purposes, stop words typically are removed to reduce the number of tokens involved in the training set. A predefined list of stop words is available in programming languages to help with this task. In some cases, additional stop words can be added to the list based on the content. For example, the word “exhibit” may occur often in financial filings, which in general is not a stop word but in the context of the filings can be treated as a stop word.
- 3. Stemming is the process of converting inflected forms of a word into its base word (known as stem). Stemming is a rule-based approach, and the results need not necessarily be linguistically sensible. Stems may not be the same as the morphological root of the word. Porter’s algorithm is the most popular method for stemming. For example, the stem of the words “analyzed” and “analyzing” is “analyz.” Similarly, the British English variant “analysing” would become “analys.” Stemming is available in R and Python. The text mining package in R provides a stemDocument function that uses this algorithm.
- 4. Lemmatization is the process of converting inflected forms of a word into its morphological root (known as lemma). Lemmatization is an algorithmic approach and depends on the knowledge of the word and language structure. For example, the lemma of the words “analyzed” and “analyzing” is “analyze.” Lemmatization is computationally more expensive and advanced.
- After the cleansed text is normalized, a bag-of-words is created. Bag-of-words (BOW) representation is a basic procedure used to analyze text.
- The last step of text preprocessing is using the final BOW after normalizing to build a document term matrix (DTM). DTM is a matrix that is similar to a data table for structured data and is widely used for text data. Each row of the matrix belongs to a document (or text file), and each column represents a token (or term). The number of rows of DTM is equal to the number of documents (or text files) in a sample dataset.
- N-grams is a representation of word sequences. The length of a sequence can vary from 1 to n. When one word is used, it is a unigram; a two-word sequence is a bigram; and a 3-word sequence is a trigram; and so on.
Data exploration
- Exploratory data analysis (EDA) is the preliminary step in data exploration. Exploratory graphs, charts, and other visualizations, such as heat maps and word clouds, are designed to summarize and observe data.
- Feature selection is a process whereby only pertinent features from the dataset are selected for ML model training. Selecting fewer features decreases ML model complexity and training time.
- Feature engineering is a process of creating new features by changing or transforming existing features. Model performance heavily depends on feature selection and engineering.
Structured Data
- Exploratory Data Analysis: For structured data, each data table row contains an observation and each column contains a feature. The basic one-dimension exploratory visualizations are as follows: Histograms, Bar charts, Box plots, Density plots
- Feature selection and feature engineering (e.g., taking logarithm of a feature)
Unstructured Data
- Exploratory Data Analysis: It is useful to perform EDA of text data by computing on the tokens such basic text statistics as term frequency (TF), the ratio of the number of times a given token occurs in all the texts in the dataset to the total number of tokens in the dataset
- Feature Selection: The general feature selection methods in text data are as follows:
- 1. Frequency measures can be used for vocabulary pruning to remove noise features by filtering the tokens with very high and low TF values across all the texts. Document frequency (DF) is another frequency measure that helps to discard the noise features that carry no specific information about the text class and are present across all texts. The DF of a token is defined as the number of documents (texts) that contain the respective token divided by the total number of documents. It is the simplest feature selection method and often performs well when many thousands of tokens are present.
- 2. Chi-square test can be useful for feature selection in text data. The chisquare test is applied to test the independence of two events: occurrence of the token and occurrence of the class. The test ranks the tokens by their usefulness to each class in text classification problems. Tokens with the highest chi-square test statistic values occur more frequently in texts associated with a particular class and therefore can be selected for use as features for ML model training due to higher discriminatory potential.
- 3. Mutual information (MI) measures how much information is contributed by a token to a class of texts. The mutual information value will be equal to 0 if the token’s distribution in all text classes is the same. The MI value approaches 1 as the token in any one class tends to occur more often in only that particular class of text. Exhibit 18 shows a simple depiction of some tokens with high MI scores for their corresponding text classes. Note how the tokens (or features) with the highest MI values narrowly relate to their corresponding text class name.
- Feature engineering
- The following are some techniques for feature engineering, which may overlap with text processing techniques.
- 1. Numbers: In text processing, numbers are converted into a token, such as “/number/.” However, numbers can be of different lengths of digits representing different kinds of numbers, so it may be useful to convert different numbers into different tokens. For example, numbers with four digits may indicate years, and numbers with many digits could be an identification number. Four-digit numbers can be replaced with “/number4/,” 10-digit numbers with “/number10/,” and so forth.
- 2. N-grams: Multi-word patterns that are particularly discriminative can be identified and their connection kept intact. For example, “market” is a common word that can be indicative of many subjects or classes; the words “stock market” are used in a particular context and may be helpful to distinguish general texts from finance-related texts. Here, a bigram would be useful as it treats the two adjacent words as a single token (e.g., stock_market).
- 3. Name entity recognition (NER): The name entity recognition algorithm analyzes the individual tokens and their surrounding semantics while referring to its dictionary to tag an object class to the token. Exhibit 19 shows the NER tags of the text “CFA Institute was formed in 1947 and is headquartered in Virginia.” Additional object classes are, for example, MONEY, TIME, and PERCENT, which are not present in the example text. The NER tags, when applicable, can be used as features for ML model training for better model performance. NER tags can also help identify critical tokens on which such operations as lowercasing and stemming then can be avoided (e.g., Institute here refers to an organization rather than a verb). Such techniques make the features more discriminative.
- Parts of speech (POS): Similar to NER, parts of speech uses language structure and dictionaries to tag every token in the text with a corresponding part of speech. Some common POS tags are noun, verb, adjective, and proper noun. Exhibit 19 shows the POS tags and descriptions of tags for the example text. POS tags can be used as features for ML model training and to identify the number of tokens that belong to each POS tag. If a given text contains many proper nouns, it means that it may be related to people and organizations and may be a business topic. POS tags can be useful for separating verbs and nouns for text analytics. For example, the word “market” can be a verb when used as “to market …” or noun when used as “in the market.” Differentiating such tokens can help further clarify the meaning of the text. The use of “market” as a verb could indicate that the text relates to the topic of marketing and might discuss marketing a product or service. The use of “market” as a noun could suggest that the text relates to a physical or stock market and might discuss stock trading. Also for POS tagging such compound nouns as “CFA Institute” can be treated as a single token. POS tagging can be performed using libraries or packages in programming languages.
Model training
Model fitting errors are caused by several factors—the main ones being dataset size and number of features in the dataset.
- Dataset Size: Small datasets can lead to underfitting of the model since small datasets often are not sufficient to expose patterns in the data. Restricted by a small dataset, an ML model may not recognize important patterns.
- Number of Features: A dataset with a small number of features can lead to underfitting, and a dataset with a large number of features can lead to overfitting.
Method Selection
- Supervised or unsupervised learning: The data for training and testing supervised ML models contain ground truth, the known outcome (i.e., target variable) of each observation in these datasets.
- Type of data. For numerical data (e.g., predicting stock prices using historical stock market values), classification and regression tree (CART) methods may be suitable. For text data (for example, predicting the topic of a financial news article by reading the headline of the article), such methods as generalized linear models (GLMs) and SVMs are commonly used. For image data (e.g., identifying objects in a satellite image, such as tanker ships moving in and out of port), NNs and deep learning methods tend to perform better than others. For speech data (e.g., predicting financial sentiment from quarterly earnings’ conference call recordings), deep learning methods can offer promising results.
- Size of data. A typical dataset has two basic characteristics: number of instances (i.e., observations) and number of features. The combination of these two characteristics can govern which method is most suitable for model training. For instance, SVMs have been found to work well on “wider” datasets with 10,000 to 100,000 features and with fewer instances. Conversely, NNs often work better on “longer” datasets, where the number of instances is much larger than the number of features.
- Class imbalance, where the number of instances for a particular class is significantly larger than for other classes, may be a problem for data used in supervised learning because the ML classification method’s objective is to train a high-accuracy model.
Performance evaluation
Error analysis. For classification problems, error analysis involves computing four basic evaluation metrics: true positive (TP), false positive (FP), true negative (TN), and false negative (FN) metrics. FP is also called a Type I error, and FN is also called a Type II error. Exhibit 23 shows a confusion matrix, a grid that is used to summarize values of these four metrics.
- Precision is the ratio of correctly predicted positive classes to all predicted positive classes. Useful for type I error.
- Recall (also known as sensitivity) is the ratio of correctly predicted positive classes to all actual positive classes. Useful for type II error.
- Accuracy is the percentage of correctly predicted classes out of total predictions.
- F1 score is the harmonic mean of precision and recall. F1 score is more appropriate (than accuracy) when unequal class distribution is in the dataset and it is necessary to measure the equilibrium of precision and recall.
Receiver Operating Characteristic (ROC). This technique for assessing model performance involves the plot of a curve showing the trade-off between the false positive rate (x-axis) and true positive rate (y-axis) for various cutoff points—for example, for the predicted probability (p) in a logistic regression.
- False positive rate (FPR) = FP/(TN + FP)
- True positive rate (TPR) = TP/(TP + FN) = Recall
- Area under the curve (AUC) is the metric that measures the area under the ROC curve. An AUC close to 1.0 indicates near perfect prediction, while an AUC of 0.5 signifies random guessing.
Root Mean Squared Error (RMSE).
Tune
Grid search is a method of systematically training an ML model by using various combinations of hyperparameter values, cross validating each model, and determining which combination of hyperparameter values ensures the best model performance.
Financial forecasting projects
- Collection frequency (CF)—is the number of times a given word appears in the whole corpus (i.e., collection of sentences) divided by the total number of words in the corpus.
- Term frequency (TF) is is the number of times a given word appears in one document (dataset)
- TF (Sentence Level) = WordCountInSentence/TotalWordsInSentence.
- Document frequency (DF) is the number of documents (texts) that contain the respective token divided by the total number of documents
- DF = SentenceCountWithWord/Total number of sentences.
- IDF (Inverse Document Frequency): A relative measure of how unique a term is across the entire corpus. IDF = log(1/DF).
- TF–IDF: To get a complete representation of the value of each word, TF at the sentence level is multiplied by the IDF of a word across the entire dataset. Higher TF–IDF values indicate words that appear more frequently within a smaller number of documents. TF–IDF = TF × IDF.