Announcing The Little Learner: A Straight Line to Deep Learning

Daniel P. Friedman and I (Anurag Mendhekar) are pleased to announce that our upcoming book The Little Learner: A Straight Line to Deep Learning just got its release date, complete with a Preorder Sale (Barnes and Noble, 25%) The book comes out on 2/21/2023.

"The Little Learner" covers all the concepts necessary to develop an intuitive understanding of the workings of deep neural networks: tensors, extended operators, gradient descent algorithms, artificial neurons, dense networks, convolutional networks, residual networks and automatic differentiation.

The authors aim to explain the workings of Deep Learning to readers who may not have the mathematical sophistication necessary to read the existing literature on the subject. Unlike other books in the field, this book makes very few assumptions about background knowledge (high-school mathematics and familiarity with programming). The authors use a layered approach to construct advanced concepts from first principles using really small (“little”) programs that build on one another. This is one of the things that makes this book unique.

The other is that it introduces these ideas using a conversational style in Question/Answer format that is characteristic of the other books in the Little series. The conversational style puts the reader at ease and enables the introduction of ideas in frame-by-frame manner as opposed to being hit with a wall of text.

It is (of course!) written using elementary Scheme and the code will be released as a Racket package.


Thank you @themetaschemer :racket_heart:

I’ve added it to Books · racket/racket Wiki · GitHub

Is there a publisher page I can link to?

Best regards


This is awesome! Are you able to share the table of contents by any chance?

1 Like

I have to wait until Feb 23? The cruelty!

VERY excited to read this—I’ve wanted something like this for a long time. If you want an early reader with a mind unspoiled by any particular intelligence or preexisting deep understanding of the domain, I offer up mine. :wink:


Thanks Stephen! The MIT Press hasn't put up a page for it yet, but both Amazon and BN have.


Laurent, our ToC by itself is a bit cryptic, but should give you some idea. The sequencing goes like this: Minimal Scheme Intro for those who don't know it; Minimal machine learning by hand; Tensors; Operator extension; Gradient descent; Stochastic Gradient descent and variations; Neurons + Universal approximation; Structuring Neural networks; Classification using dense layers; Signals; Convolutional layers; and two appendices that are entirely dedicated to Automatic Differentiation.


Thanks, Pete! Will keep that in mind!

1 Like

This will definitely be on my bookshelves come winter/spring next year. Wonderful!


Hey! It looks like the book is out!

I'm super excited to buy a copy.

So now, the hard part: whom to order it from? The vendors listed on the MIT Press website are ... heavily british? They list, Blackwells,, Foyles, Hive, and Waterstones. Any opinions? Um, aside from "not Amazon", which is pretty much my first and only criterion?


Oh! A little investigation suggests that is more or less exactly what small-business zealots like me are looking for. As a side note, it's absolutely tragic that there isn't a single independent new-book dealer in this town of 54K people.


Any local bookseller should be able to order.

If you are affiliated with a educational institution with a library ask if they can purchase it for you - it doesn’t matter if your are staff or student. This may also work for community libraries.

This also applies to the other fine books listed at

Some directories of booksellers:

1 Like

I'll put in a plug for the Seminary Co-op, "the country’s first not-for-profit bookstores whose mission is bookselling." It's on special order right now (which only adds a few days), but they're very likely to stock it if people order it. (They have actual human booksellers empowered to make decisions about books that seem interesting.)

1 Like

I pre-ordered the book a couple months ago on I got a pleasant surprise a few days ago when they charged my credit card, which they do only when they ship the book. Very much looking forward to diving in!

1 Like

Got my copy yesterday! Very excited.

1 Like

Just a note, I get a bad cert domain error when trying to use the link for semcoop.

Whatever the problem was, it seems to have been fixed.

Some personal notes as I read it.

It's a good book so far, but there are occasionally part of it that frustrate me to no end. It's the kind of book in which I'll probably start writing notes into the margin. Thankfully, there's a lot of room.

I'm just past Chapter 5. One of the key problems I've had with it is really due to my own blasted curiosity. I was curious to see what would happen if I bumped up the number of revs from the tiny 1000 to something ridiculously larger, like 1,000,000.

That is, I should be able to do something like:

(with-hypers ((alpha 0.001)
              (revs 1000000))
  (gradient-descent ((l2-loss plane) plane-xs plane-ys)
                    (list (tensor 0.0 0.0) 0.0)))

and I expect this to take some time, but it should just work, right? But alas, no!

And it doesn't work for non-obvious reasons. The representation of tensor values are using a special encoding that's used for automatic differentiation, this has the implication that the more operations we do in iterating a value (such as the theta model value), the more space the tensor representation is taking, and eventually we run out of memory.

It's the same problem I ran into when working with rational numbers: by maintaining perfect precision, we can end up with ridiculous bignum numerators and denominators if we're not careful.

After fighting the documentation, I figured out that a way to handle this is to rip out the special automatic-differentiation representation, the duality, like this:

;; rip out the duality
(define rip (ext1 ρ 0))

After which, we can redefine gradient-descent to rip out duality between each iteration:

(define gradient-descent
  (lambda (obj theta)
    (let ([f (lambda (big-theta)
               (map rip
                    (map (lambda (p g)
                           (- p (* alpha g)))
                         (gradient-of obj big-theta))))])
      (revise f revs theta))))

And now it works.

So now I have a little more understanding on what's going on underneath the surface, and it makes me simultaneously happy and sad about it. Happy that I understand this better now, but sad that I am incomprehensible; there are very few people I interact with that care about this sort of stuff...

Time to read more.


I got stuck in this book earlier than you, your post makes me want to get back into it, many thanks for the bump!

Does the memory consumption issue still occur when using one of the alternate representations of tensors? I haven't tried anything other than the 'learner representation so far.

I did try switching representations to 'flat-tensors, but it did not seem to help. It's possible that I didn't do it right. Independent confirmation would be interesting and useful to hear.

I finished chapter 8 over the weekend. This was one of the more frustrating chapters to me, but not because the material was difficult, but because the analogy of momentum to relay runners did not feel natural. It felt so contrived that it really bothered me for the rest of the chapter, honestly. (I suspect the miss for me here may be a cultural thing: I do not have any particular fondness to Groucho, Chico, Harpo, Gummo, or Zeppo.)

I think I can summarize the last few chapters:

  • Chapter 6: add batching with random selection so we can do stochastic gradient descent.

  • Chapter 7: generalizes gradient-descent so that each parameter can be augmented, or "accompanied" by additional data to carry state over the iterations. The expected changes are to use functions to (1) wrap each parameter "inflate" on entry, (2) unwrap each parameter before exit "deflate", and finally, (3) update each parameter, incorporating the extra state. Since "inflate", "deflate", and "update" share the same suffix, this chapter jokes about this a bit (perhaps a bit too much).

  • Chapter 8: uses the results from chapter 7 and shows how to attach auxiliary "velocity" to each parameter. We add a momentum mechanism during gradient update, to help smooth the walk through the parameter space. External resources like: Momentum - Cornell University Computational Optimization Open Textbook - Optimization Wiki better explain what's happening when we introduce momentum: the incorporation of momentum helps to reduce oscillation in practice.