# Time Series 2 — Detrending by Linear Regression

In clinical researches, eating disorders can take months or years to be acknowledged by individuals, and the recovery can take even longer for some patients to break the ‘vicious cycle’ between dieting and eating disorders. From the time series point of view, we will uncover how this ‘vicious cycle’ presents itself in terms of the relationship between weekly submissions to eating disorders forums and dieting forums, especially the latent lag effects between these two.

###### Assumptions
1. Each random variable has been scaled in the way of subtracting its mean and dividing by its standard deviation. Thus, if a variable’s value increases by 1 unit, it means 1 standard deviation up from the average.
2. The ‘control group’ gathers topics like news, pets, politics, and other popular subreddit forums. We assume the monthly number of submissions to this group of forums indicates the general usage and submitting frequency of ‘reddit’ website.
3. The ‘general health group’ covers subreddit forums like health, sports, cook, recipe and medicine, which we assume to be neutral and topically unrelated to eating disorders and dieting.
###### Explore the Trend / Time Factor

Our goal is to disclose a temporal relationship between different types of post submissions, with regards to only two factors — dieting and eating disorders. Thus, we are looking for a new version of these two time series that merely contains the information of dieting or eating disorders.

Here is a graphical revisit of the original version of time series, wherein we have revealed a clearly increase in unit post submissions to each forum every week.

A better way to capture the trend is using smoothing methods. we plot the LOESS regressed trends the graph below.

###### Explore Other Confounding Factors

There are still other conducive factors that might contribute to the rise and fluctuation in number of post submissions to dieting and eating disorders forums. A list of other factors includes the mechanical post selection errors during data collection, the API restrictions set by reddit IT department, the enlargement of public awareness of health and wellbeing, the overemphasized connection between general health and body weight (commerce-driven false information at most time), etc.

Since we assume the overall systematic change and effect is reflected by the control group post submissions (ctrl) and the incremental public awareness on physical health is given by the change in general health post submissions (gh), we are curious about how eating disorders and dieting post submissions is related to these variable by themselves, in other words, independent of time.

Like most exploratory data analysis, we plot those variables in pairs. It’s noticeable that the plots of ‘ed’ against ‘ctrl’, ‘ed’ against ‘gh’, ‘diet’ against ‘ctrl’, ‘diet’ against ‘gh’ are not acceptably linear. Variable transformation seem to be necessary before linear regression.

###### Transformation and Correlation Plots

By plotting histograms, we can see the distributions of original variables are all hugely skewed right distributions, consistent with the nature of our time series that we observed more weeks of less post submissions to each forum group before year 2016.

To make them less skewed and augment tiny variances in those weeks with less observations, we will take the logarithm transformation on each variable after shifting them by 1 unit in positive direction. (We can not perform logarithm on non-positive values.)

We can now print the paired scatter plots to see the correlations between transformed variables, which are more linear than original ones. We also add a new independent variable, the squared value of time, as one of the predictors, in order to capture a more complicated effect from time when the response variable is not linearly related with time itself.

###### Linear Regression

To disclose the correlation between ‘ed’ and ‘diet’ while controlling other confounding factors, we can simply specify ‘diet’ and other factors as independent variables / predictors, ‘ed’ as dependent variable / response, and build a linear regression model to compute coefficients.

However, we are not only looking for the coefficient. Our goal is to generate two new time series, ed and diet, that only consist of information / characteristic of eating disorders post submissions and dieting post submissions themselves. The way to do so is to regress each of the original two time series on the same variables: time, timesq, ctrl, gh, and the products terms from possible combination pairs among them.

We can select the ‘best’ linear model that provides the lowest AIC score and evaluate the overall performance by AIC and adjusted R-squared.

The two regression formulas we specify in lm() function in R, which yields the highest AIC score for both dependent variables — ‘ed’ and ‘diet’, are given below.

$log(ed+1) \sim time * log(ctrl+1) * log(gh+1) + timesq * log(ctrl+1) * log(gh+1)$

$log(diet+1) \sim time * log(ctrl+1) * log(gh+1) + timesq*log(ctrl+1) * log(gh+1)$

The summary results of these two models:

It’s worth noting that there is a trade-off in this task between model complexity and interpretation. If we upgrade the model complexity by increasing the order of predicting variables except for time, for example, the second order of ‘ctrl’ or ‘gh’, we are assuming a nonlinearity between number of post submissions to responding groups(‘ed’, ‘diet’) and predicting groups (‘ctrl’,’gh’), which is too arbitrary to interpret in reality. It makes more sense that in weeks people submit more post to subreddit forums, the overall expanded usage of the platform increases the post submissions to particular groups of forums like eating disorders related ‘subreddit’s and dieting related ‘subreddit’s.

Meanwhile, the reason we include the second order of time is justifiable, considering the explicit non-linear trend given by the time series plot and adding the order by one degree drastically boosted model performance with regards to AIC, but adding degrees more than one did not significantly improve the model performance.

###### Model Adequacy Check

We evaluate the accuracy of linear models by plotting residuals against time and fitted values agains time.

In general the residuals followed the error term assumptions in linear regression models.

###### New Time Series

So far, we have had two new time series with fitted values for ‘log(ed+1)’ and ‘log(diet+1)’. We are not yet allowed to subtract original time series values from fitted values at this time because it is transformed. Also, we will not use the residuals directly from this version of model because it has little real-world meanings under this mathematical form. To generate the final valid time series, we first transform the fitted values by taking the exponential of them and shifting by -1.

Finally, the two new time series (residuals from linear regression models) look like this:

They have been perfectly ‘detrended’, corresponding to the definition of weakly stationary in time series. The interpretation of CCF (cross-correlation function) plot at the bottom will come in future posts.

Stay tuned and take care!