:ext QuasiQuotes
import qualified H.Prelude as H
H.initialize H.defaultConfig
[r| require("ggplot2")
require("pracma")|]
[r|
ggplotRegression <<- function(the.title, dat, xvar, yvar){
require(ggplot2)
fit <- lm(yvar~xvar, dat)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point(size=2, shape=21) +
ggtitle(the.title) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.caption = element_text(hjust = 0.5)) +
theme(text=element_text(family="Ariel", size=12)) +
stat_smooth(method = "lm", col = "red", se=FALSE) +
labs( caption = paste("RSq = ",signif(summary(fit)$r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
} |]
[r| dset_small <<- read.table ("TM30-CES99.csv", header=TRUE, sep=",");
|]
[r| dset_pca <<-prcomp (dset_small[c(1:15)], center=TRUE, scale=TRUE) |]
[rprint| summary(dset_pca) |]
[r| x=dset_pca$x[,1] #PC1
y=dset_pca$x[,2] #PC2
z=dset_pca$x[,3] #PC3
dset_small$XScaled = scale(dset_small$X)
actX=dset_small$XScaled
exportFrame.x <<- data.frame(x,y,z,actX)
write.csv(exportFrame.x, "frameX.txt", row.names=FALSE) |]
[r| PC1=dset_pca$x[,1] #PC1
PC2=dset_pca$x[,2] #PC2
z=dset_pca$x[,3] #PC3
actY = scale(dset_small$Y)
exportFrame.y <<- data.frame(PC1,PC2,actY)
write.csv(exportFrame.y, "frameY.txt", row.names=FALSE) |]
[r| PC1=dset_pca$x[,1] #PC1
PC2=dset_pca$x[,2] #PC2
z=dset_pca$x[,3] #PC3
actZ = scale(dset_small$Z)
exportFrame.z <<- data.frame(PC1,PC2,actZ)
write.csv(exportFrame.z, "frameZ.txt", row.names=FALSE) |]
For now it's a bit kludgy. A 'csv' file is created by a Haskell program from a copy/paste from the Scheme HPEC program and read in here. V1 is the predicted value, V2 is the actual value. The plot below is just the first sanity check for the process.
[rprint| predictedX <<- read.csv(file = 'paste01.csv', header=FALSE) |]
[rgraph|
ggplot(predictedX, aes(x=predictedX[,2], y=predictedX[,1])) +
ggtitle("Actual X vs Predicted X") +
xlab("Actual") + ylab("Predicted") +
geom_point(size=2, shape=21) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text=element_text(family="Ariel", size=12)) |]
(program->) (#procedure:s-plus ()) (#procedure:s-swap-1 ()) (#procedure:s-times ()) (#procedure:s-plus ()) (#procedure:s-minus ()) (#procedure:s-divide ()) (#procedure:s-plus ()) (#procedure:s-times ()) (#procedure:s-plus ()) (#procedure:s-divide ()) (#procedure:s-divide ()) (#procedure:s-push (3)) (#procedure:s-push (2)) (#procedure:s-minus ()) (#procedure:s-swap-1 ()) (#procedure:s-push (1)) (#procedure:s-push (0)) (#procedure:s-pop ()) (#procedure:s-divide ()) (#procedure:s-times ()) (#procedure:s-swap-1 ()) (#procedure:s-times ())
To interpret this, it helps to know that,
I'll have to check this to be absolutely sure ..
Another run of the HP stack computer using PC1, PC2, and PC3 was conducted yielding the following formula,
$Prediction = 0.25 \cdot (P2 - P1 - P3)$
This seems a bit better,
$R^2 = 0.987$
[rgraph|
pc1 = exportFrame.x$x
pc2 = exportFrame.x$y
pc3 = exportFrame.x$z
actualB <<- exportFrame.x$actX
predictionB <<- 0.26 * (pc2 - pc1 - pc3)
newFrame = data.frame (actualB, predictionB)
ggplot(newFrame, aes(x=newFrame[,1], y=newFrame[,2])) +
ggtitle("Actual X vs Predicted X") +
xlab("Actual") + ylab("Predicted") +
geom_smooth(method='lm',se=FALSE) +
geom_point(size=2, shape=21) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(text=element_text(family="Ariel", size=12)) |]
[rprint| res = lm(predictionB ~ actualB)
cor(predictionB, actualB)^2 |]
Note only a small improvement in R squared with the addition of PC3.
As the formulas seem quite simple, a PSO template was created to predict X values as follows for a prediction $P$ using only PC1 and PC2.
$P = k_1 + k_2 ((k_3\cdot pc_2) - (k_4\cdot pc_1))$
Where $k_1$ through $k_4$ are found using PSO, specifically,
$k_1 = 0.056$, $\ k_2 = 0.27$, $\ k_3 = 0.91$, $\ k_4 = 0.96$
The graphical results are shown below.
[rgraph|
pc1 = exportFrame.x$x
pc2 = exportFrame.x$y
actualA = exportFrame.x$actX
predictionA = 0.056 + 0.27 * ((0.91*pc2) - (0.96*pc1))
newFrame = data.frame (actualA, predictionA)
ggplotRegression("Actual vs Predicted for X", newFrame, actualA, predictionA) |]
Unfortunately I didn't take good notes but I believe the prediction formula for Y came from HPEC.
$P = 5 + -0.26*pc_1 + 0.16*pc_2 + 0.01 *pc_1*pc_2$
Graphical results are shown below.
[rgraph|
pc1 = exportFrame.y$PC1
pc2 = exportFrame.y$PC2
actualA <<- exportFrame.y$actY
predictionA <<- -0.26*pc1 + 0.16*pc2 + 0.01 *pc1*pc2 + 5
newFrame <<- data.frame (actualA, predictionA)
ggplotRegression("Actual vs Predicted for Y", newFrame, actualA, predictionA) |]
Several templates were tried.
A template which allowed non-integer powers of PC1 and PC2 was tried which produced good results in training but horrible results in testing, implying overfitting.
Eventually the same template that was used for X is applied here. Specifically,
$P = k_1 + k_2 ((k_3\cdot pc_2) - (k_4\cdot pc_1))$
Where $k_1$ through $k_4$ are,
$k_1 = -0.17$, $\ k_2 = 0.27$, $\ k_3 = 0.95$, $\ k_4 = 0.75$
Graphically shown below.
[rgraph|
pc1 = exportFrame.z$PC1
pc2 = exportFrame.z$PC2
actualA <<- exportFrame.z$actZ
predictionA <<- -0.17 - 0.18 * ((1.43*pc2) - (-1.13*pc1))
predictionA <<- -0.17 + 1.15 * ((-0.24*(pc2**0.1859)) - (1.23*(pc1**0.4397)))
predictionA <<- -0.17 + 0.27 * ((-0.95*pc2) - (0.75*pc1))
newFrame <<- data.frame (actualA, predictionA)
ggplotRegression("Actual vs Predicted for Z", newFrame, actualA, predictionA) |]
Z's relatively poor performace is of interest.
If it is noise a Histogram plot of the $R^2$ residuals might give a clue to it's source.
This is done below.
[rgraph|
fit <- lm(predictionA ~ actualA, newFrame)
g <- fit$residuals
h <- hist(g,main="Histogram of Z Residuals", breaks=11,family="ariel",col="lightgray")
xfit <- seq(min(g), max(g), length = 40)
yfit <- dnorm(xfit, mean = mean(g), sd = sd(g))
yfit <- yfit * diff(h$mids[1:2]) * length(g)
lines(xfit, yfit, col = "blue", lwd = 1)
|]
The distrubution appears to be somewhat Gaussian.
So, what are sources of Gaussian noise (as opposed to Uniform noise)? Johnson–Nyquist noise (or thermal noise) is Gaussian. The implication might be that the signal to noise ratio is worse for the Z values.
This analysis was intended to be a first cut to determine the difficulty of the prediction problem. For HPEC all the data was used for training and predicting. For PSO half the data was used to train and the full dataset was used for predicting.
So, to summarize, Using the small (~100 data samples) dataset reasonable $R^2$ values were obtained to predict X, Y, and Z values.
Value Predicted | R-Squared |
---|---|
X | 0.988 |
Y | 0.971 |
Z | 0.914 |