So I implemented batch gradient descent in Haskell, to simultaneously solidify my understanding of the algorithm and work on learning Haskell.
It got a bit bumpy. I've preserved my realtime notes of the mess. But the short version is that after a certain number of iterations that was an increasing function the learning rate, the model would just terminate in weights of Infinity for all features.
Step 1: get some more eyeballs. A kindly recurser pointed out that the argument orders to one of my functions was switched. This is the kind of thing that Haskell is supposed to avoid, but when everything is of type [Double] I guess one can hardly rely on the type checker to catch that it's the wrong one, can one?
Alas, that wasn't the bug. I was just multiplying two lists together elementwise, and since multiplication is commutative, well.
Step 2: implement in an easier language first. With math-y code that produces wildly incorrect results, there are really two possibilities: either the code is buggy or the code is fine but my understanding of the math is wrong. (Or both, I suppose.) The second possibility seemed easier to eliminate first—with correct math, one can get evidence supporting the proposition that the code is correct from getting correct results, but it doesn't really work the other way around unless there's some way of proving code correctness other than testing. So.
So I re-implemented the algorithm in python, complete with lots of slow and careful idiot-checking print statements. Somewhere in the middle, I realized that I hadn't scaled the features in the Haskell version (alas, that didn't turn out to be the bug either, but it did kindly turn my lists of
Infinity into lists of
NaN). The Python version produced reasonable results. So that's great, my math is good. Back to the Haskell.
printfstatements into this language—do you need an IO type or something??– I was at a loss for a while.
It turns out that ghci has a debugger, which I played around with for a while, but it wasn't terribly enlightening—it's a bit involved, and could use more study.
Then I discovered that there is a printf equivalent. Because of course there is. It's in the magic Debug.Trace library. And it's extra magical: you can just stick a call to
traceShow in front of whatever code you want to look at, and you'll get to see whatever values you pass it.
With that, it didn't take me long to find my bug. I successively logged 20 iterations of all the interesting parameters to my gradient descent function, and soon discovered that on each iteration of the gradient, the error was monotonically increasing, rather than decreasing. Which obviously isn't right.
The obvious hypothesis there is that a sign got flipped somewhere. And, lo and behold, after looking at the relevant part of the code, inside the gradient descent summation where I meant to be subtracting the label from the hypothesis, I actually was subtracting the hypothesis from the label.
Subtraction, alas, is not commutative.
Anyway, here's the fixed code!
module Olsgradient where import Data.List default (Double) addIntercept :: [[Double]] -> [[Double]] addIntercept = map (\x -> 1.0:x) predict :: [[Double]] -> [Double] -> [Double] predict observations weights = let mult = map (\x -> zipWith (*) x weights) observations in map sum mult subtractMaker :: Double -> [Double] -> [Double] -> Double subtractMaker learnRate costs featureList = let costFeatureMult = zipWith (*) costs featureList in learnRate * sum costFeatureMult gradientStep :: Double -> [Double] -> [Double] -> [[Double]] -> [Double] gradientStep learnRate labels weights observations = let preds = predict observations weights costs = zipWith (-) preds labels featureMatrix = transpose observations subtractors = map (subtractMaker learnRate costs) featureMatrix in zipWith (-) weights subtractors innerTrainOLS :: [[Double]] -> [Double] -> [Double] -> Double -> Double -> Double -> Double -> [Double] innerTrainOLS observations labels weights learnRate threshold maxIter numIter | numIter > maxIter = weights | sse < threshold = weights | otherwise = innerTrainOLS observations labels newWeights learnRate threshold maxIter (numIter + 1) where preds = predict observations weights sse = sum $ map (**2.0) (zipWith (-) labels preds) newWeights = gradientStep learnRate labels weights observations trainOLS :: [[Double]] -> [Double] -> Double -> Double -> Double -> [Double] trainOLS observations labels learnRate threshold maxIter = let obvs = addIntercept observations numFeats = length $ head obvs initweights = replicate numFeats 1 in innerTrainOLS obvs labels initweights learnRate threshold maxIter 0 mean :: [Double] -> Double mean lst = sum lst / fromIntegral (length lst) standardDeviation :: [Double] -> Double standardDeviation lst = let m = mean lst n = length lst squaredErrors = map (\x -> (x - m) ** 2.0) lst in sqrt (sum squaredErrors / fromIntegral n) scale :: [Double] -> [Double] scale lst = let m = mean lst stdev = standardDeviation lst in map (\x -> (x - m) / stdev) lst
Note how smelly it is. For example,
innerTrainOLS :: [[Double]] -> [Double] -> [Double] -> Double -> Double -> Double -> Double -> [Double] is like legit stink. But it does the job for now, and cleanup can come later. :-)