Section 13.4 Primes as Sum of Squares
ΒΆRemark 13.4.1.
To keep with the theme of the unity of mathematics, we do this geometrically, not algebraically as in most texts, though the core ideas are similar with both proofs. We roughly follow [C.2.1, Chapter 10.6], but expanded greatly to avoid any direct reference to Hermann Minkowski's theorem on lattice points in a convex symmetric set. Interestingly, [C.4.16, Theorems 4.3 and 8.3] only state this and Lagrange's four square theorem, precisely because although Minkowski's Theorem provides a general framework for existence of such points geometrically, one still requires information about quadratic residues to provide lattice points to work on in the first place.
Subsection 13.4.1 A useful plot
ΒΆFirst, let's look at the following plot on the integer lattice. As you can see, I am plotting certain points on the circle x2+y2=n, with n=5 to begin. I have done some βmagicβ to turn the square root of β1 (mod n) into these points. Before telling you the magic, Figure 13.4.2 (and the interact following it) will help us get ready.
Remark 13.4.3.
Sometimes we call things like the set of blue dots a lattice, though in this text I will usually use the word lattice only to refer to the usual integer lattice of the black dots. A general lattice is something related to a concept from linear algebra β vectors generated by a basis, except instead of being vectors over Q or R, they are over Z.
xxxxxxxxxx
def _(p=(5,[q for q in prime_range(200) if q%4==1])):
f=mod(factorial((p-1)/2),p)
viewsize=ceil(math.sqrt(p))+2
g(x,y)=x^2+y^2
plot1 = implicit_plot(g-p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
grid_pts = [[i,j] for i in [-viewsize..viewsize] for j in [-viewsize..viewsize]]
plot_grid_pts = points(grid_pts,rgbcolor=(0,0,0),pointsize=2)
lattice_pts = [coords for coords in grid_pts if (coords[1]-f*coords[0])%p==0]
plot_lattice_pts = points(lattice_pts, rgbcolor = (0,0,1),pointsize=20)
show(plot1+plot_grid_pts+plot_lattice_pts, figsize = [5,5], xmin = -viewsize, xmax = viewsize, ymin = -viewsize, ymax = viewsize, aspect_ratio=1)
Definition 13.4.4.
We call the norm of a point (x,y) the sum of squares, N(x,y)=x2+y2.
Subsection 13.4.2 Primes which are sums of squares
ΒΆWe are now ready to state our big theorem for the section. (See Fact 14.1.8 for a quite different proof.)Theorem 13.4.5.
Every prime p of the form 4k+1 can be written as a sum of squares.
Proof.
The proof is fairly long. Here is the strategy; the first step will be detailed in Subsection 13.4.3 and Subsection 13.4.4.
Suppose we find some blue dot \((a,af+bp)\) such that
Then we know, modulo \(p\text{,}\) that
so \(p\) in fact divides the norm of the point \((a,af+bp)\text{.}\)
So we have that \(0<a^2+(af+bp)^2<2p\) and that \(p\mid a^2+(af+bp)^2\text{,}\) meaning the only possibility is \(p=a^2+(af+bp)^2\text{,}\) which gives \(p\) explicitly written as a sum of perfect squares.
Example 13.4.6.
For instance, with p=5, we have that f=(5β12)!=2!=2, so we need to find a point (a,2a+5b) such that
Guess and check with a=1 and b=0 gives us
so this point should work, and this does give the correct statement that
Subsection 13.4.3 Visualizing the proof
ΒΆTo prove the theorem that for any p=4k+1 we can write it as a sum of squares, we need to prove there is a blue dot (somewhere) that is not at the origin but also has norm smaller than 2p. We will prove this by heavy reference to graphics, but all claims also make sense algebraically. Sometimes we need help to be able to think about more involved proofs. We include a variation on the graphic in Figure 13.4.7 to make this visually clear. The bigger circle is the one we care about now β it has formula x2+y2=2p, so radius β2p. If we find a blue point inside the disk bounded by that circle, but not at the origin, then the argument in the proof sketch given for Theorem 13.4.5 shows this point must be on the smaller circle.
xxxxxxxxxx
def _(p=(5,[q for q in prime_range(200) if q%4==1])):
f=mod(factorial((p-1)/2),p)
viewsize=floor(sqrt(2*p))+2
g(x,y)=x^2+y^2
plot1 = implicit_plot(g-p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
plot2 = implicit_plot(g-2*p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
grid_pts = [[i,j] for i in [-viewsize..viewsize] for j in [-viewsize..viewsize]]
plot_grid_pts = points(grid_pts,rgbcolor=(0,0,0),pointsize=2)
lattice_pts = [coords for coords in grid_pts if (coords[1]-f*coords[0])%p==0]
plot_lattice_pts = points(lattice_pts, rgbcolor = (0,0,1),pointsize=10)
show(plot1+plot2+plot_grid_pts+plot_lattice_pts, figsize = [5,5], xmin = -viewsize, xmax = viewsize, ymin = -viewsize, ymax = viewsize, aspect_ratio=1)

The thinnest triangle made by blue dots would be of height one. A typical one would have vertices the origin and the points (p,0) (with a=0,b=1) and (f,1) (with a=1,b=0).
The thinnest triangle made by the green dots has height two. It has width 2p (from the origin to (2p,0), the previous point doubled); the apex is the point (2f,2), which is (f,1) doubled.
triangles_on
to see the green dot triangle and parallelogram outlined in red.
xxxxxxxxxx
def _(p=(5,[q for q in prime_range(200) if q%4==1]), triangles_on=False):
f=mod(factorial((p-1)/2),p)
viewsize=2*p
g(x,y)=x^2+y^2
plot1 = implicit_plot(g-p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
plot2 = implicit_plot(g-2*p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
plot3 = line([[0,0], [2*p-2*Integer(f),2], [2*p,0], [2*Integer(f),-2], [0,0]], rgbcolor=(1,0,0))
plot4 = line2d([[0,0], [2*p,0]], rgbcolor=(1,0,0), linestyle='--')
grid_pts = [[i,j] for i in [-viewsize..viewsize] for j in [-viewsize..viewsize]]
plot_grid_pts = points(grid_pts,rgbcolor=(0,0,0),pointsize=2)
lattice_pts = [coords for coords in grid_pts if (coords[1]-f*coords[0])%p==0]
plot_lattice_pts = points(lattice_pts, rgbcolor = (0,0,1),pointsize=10)
plot_lattice_pts2 = points([[2*coords[0],2*coords[1]] for coords in lattice_pts], rgbcolor = (0,1,0),pointsize=20)
if triangles_on:
show(plot1+plot2+plot3+plot4 + plot_grid_pts + plot_lattice_pts+plot_lattice_pts2, xmin = -viewsize/2, xmax = viewsize, ymin = -viewsize/2, ymax = viewsize/2, aspect_ratio=1)
else:
show(plot1+plot2+plot_grid_pts + plot_lattice_pts + plot_lattice_pts2, xmin = -viewsize/2, xmax = viewsize, ymin = -viewsize/2, ymax = viewsize/2, aspect_ratio=1)
Subsection 13.4.4 Finishing the proof
ΒΆLet's take stock.We've created circles of various sizes to find points in, and two lattices to examine.
The area of the circle is more than the area (4p) of the smallest parallelogram made by green dots.
Because all points inside the parallelogram (not just green, blue, or lattice points) will βrepeatβ outside of it in another parallelogram, 4p is the biggest area of a region that you can have and not βrepeatβ some point. (This parallelogram is often called a fundamental region in more general treatments.)
So, the interior of the circle, having a bigger area, must have two points (not necessarily blue points, just points on the plane) which are βrepeatedβ by translation of this parallelogram.
We start with the two points from Claim 13.4.11 in the disk bounded by the circle (points which are not necessarily on any lattice, blue, green, or even black).
Then we use elementary geometry to construct a blue point (namely, one of the form (a,af+bp)) which is strictly in the interior of the disk bounded by the circle of radius β2p. In particular, this point is not the origin.
Fact 13.4.9.
Let L be the parallelogram with vertices (0,0), (2f,2), (2p,0), and (2pβ2f,β2) and its interior (where f is a square root of β1 modulo p). Any plane region is the union of its intersection with all possible translations of L by rigidly moving L so that the origin is translated to another green point.
Proof.
We are not going to prove topological facts in this text, nor explore the further depths of lattices. So it suffices to note that every green point \((2a,2af+2bp)\) can serve as the leftmost vertex of a unique parallelogram not just congruent to, but translated from, \(L\text{,}\) and that by construction these cannot overlap (other than possibly along their edges).
Definition 13.4.10.
We say that two distinct points v,w in a plane region are βrepeatedβ if they are both rigid translations of the same point in L, where the allowed translations are those described in Fact 13.4.9.
Claim 13.4.11.
Consider the circle of radius β2p centered at the origin. The interior of the disk bounded by this circle has two points βrepeatedβ by shifting the parallelogram L.
Proof.
Recall from Fact 13.4.9 that the disk is composed of all its intersections with different parallelograms congruent to \(L\text{.}\)
Suppose that there are not two points βrepeatedβ within the disk (not including the boundary circle). Then every point thereof is a translation of a different point of \(L\text{.}\) One can make this a one-to-one function from the disk to \(L\) by sending each point in the disk to the corresponding one in \(L\text{.}\)
Because each such move is rigid, this function is area-preservingβ1βIf you looked at this footnote because you want a proof of this, recall we do not prove topological facts in this text! Next you'll be wanting a proof of the Jordan curve theorem from first principles. More seriously, we have to draw the line somewhere, and I find pedagogically that students would find proving assertions of this kind similar to proving \(1+1=2\) using Russell and Whitehead as a text. Convincing students that proving Fact 1.2.2 is useful is hard enough., which means the area of the disk must be less than or equal to that of \(L\text{.}\)
However, at the end of Subsection 13.4.3 we asserted the opposite! So by way of contradiction we have our two points.
Claim 13.4.12.
Given two points v,w (in the interior of the circle of radius β2p centered at the origin) which βrepeatβ from L, we can construct a point, not the origin, of the form (a,af+bp).
Proof.
Given how we defined βrepetitionβ, we know that the line segments from \(v\) and \(w\) to the leftmost vertex of their respective translations of \(L\) must themselves be rigid translations of each other, hence the line segment connecting \(v\) and \(w\) can be translated to a segment connecting the origin and another green point. Give this point the nameβ2βIn fact, as vectors of course this is the point, but we minimize formal use of vectors in this text. \(v-w\text{.}\)
Since \(v-w\) is of the form \((2a,2af+2bp)\) by definition, then the point halfway between it and the origin (or β\((v-w)/2\)β) is a blue point of the form \((a,af+bp)\text{,}\) and clearly not the origin since \(v-w\) itself is not the origin. It remains to show that this blue point is in the interior of the circle.
To see this, consider the distance \(d\) between \(v\) and \(w\text{.}\) By definition of a circle, it cannot possibly be further than twice the radius, so \(d\) is strictly less than \(2\sqrt{2p}\text{.}\) But then \(v-w\) cannot be more than \(d\) units from the origin, so the point \((a,af+bp)\text{,}\) being exactly half that distance from the origin, is less than distance \(\sqrt{2p}\) to the origin. By definition \((a,af+bp)\) is in the interior of the larger circle, as desired.
Example 13.4.13.
In Figure 13.4.14 we see the picture of how Claims 13.4.11 and 13.4.12 find the blue point in the circle. The black points are v and w, the arrows point between v and w and from the origin to vβw, and the midpoint of the second arrow is indeed blue.

Sage note 13.4.15. Examining code is good for you.
The next Sage cell makes Figure 13.4.14 interactive. But don't just use it to view the proof for other primes; examine the code itself.
This is by far the longest code we've seen up to this point. It is a brute force check of all movements of all points in the parallelogram to find two points in the bigger circle. Can you think of ways to make it more efficient?
xxxxxxxxxx
def _(p=(5,[q for q in prime_range(200) if q%4==1])):
f=Integer(mod(factorial((p-1)/2),p))
big = math.floor(math.sqrt(2*p))
viewsize=2*p
g(x,y)=x^2+y^2
plot1 = implicit_plot(g-p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
plot2 = implicit_plot(g-2*p, (-viewsize,viewsize), (-viewsize,viewsize), plot_points = 100)
plot3 = line([[0,0], [2*p-2*f,2], [2*p,0], [2*f,-2], [0,0]], rgbcolor=(1,0,0))
plot4 = line2d([[0,0],[2*p,0]], rgbcolor=(1,0,0), linestyle='--')
grid_pts = [[i,j] for i in [-viewsize..viewsize] for j in [-viewsize..viewsize]]
plot_grid_pts = points(grid_pts,rgbcolor=(0,0,0),pointsize=2)
lattice_pts = [coords for coords in grid_pts if (coords[1]-f*coords[0])%p==0]
plot_lattice_pts = points(lattice_pts, rgbcolor = (0,0,1),pointsize=10)
big_lattice_pts = [[2*coords[0],2*coords[1]] for coords in lattice_pts]
plot_lattice_pts2 = points(big_lattice_pts, rgbcolor = (0,1,0),pointsize=20)
w = []
v = []
mw = []
for i in [1..2*p-1]:
for coords in [l for l in big_lattice_pts if l!=[0,0]]:
if (i+coords[0])^2+(coords[1]-1)^2 < 2*p:
for coords2 in [k for k in big_lattice_pts if k!=[0,0] and k!=coords]:
if (i+coords2[0])^2 + (coords2[1]-1)^2 < 2*p:
w = [i+coords[0],coords[1]-1]
v = [i+coords2[0],coords2[1]-1]
vmw = [v[0]-w[0],v[1]-w[1]]
break
if w: break
if w: break
if not v:
for i in [j for j in [f..p+f]]:
for coords in [l for l in big_lattice_pts if l!=[0,0]]:
if (i+coords[0])^2+(coords[1]-1)^2 < 2*p:
for coords2 in [k for k in big_lattice_pts if k!=[0,0] and k!=coords]:
if (i+coords2[0])^2 + (coords2[1]-1)^2 < 2*p:
w = [i+coords[0],coords[1]-1]
v = [i+coords2[0],coords2[1]-1]
vmw = [v[0]-w[0],v[1]-w[1]]
break
if w: break
if w: break
if not v:
for i in [j for j in [p-f..2*p-f]]:
for coords in [l for l in big_lattice_pts if l!=[0,0]]:
if (i+coords[0])^2+(coords[1]+1)^2 < 2*p:
for coords2 in [k for k in big_lattice_pts if k!=[0,0] and k!=coords]:
if (i+coords2[0])^2+(coords2[1]+1)^2 < 2*p:
w = [i+coords[0],coords[1]+1]
v = [i+coords2[0],coords2[1]+1]
vmw = [v[0]-w[0],v[1]-w[1]]
break
if w: break
if w: break
P1=point(v,pointsize=20,rgbcolor=(0,0,0))
P2=point(w,pointsize=20,rgbcolor=(0,0,0))
Z=point(vmw,pointsize=20,rgbcolor=(0,0,0))
plot4 = arrow(w,v,rgbcolor=(0,0,0), thickness=1, linestyle='--', arrowsize=3)
plot5 = arrow((0,0),vmw,rgbcolor=(0,0,0), thickness=1, linestyle='--', arrowsize=3)
plot6 = point((vmw[0]/2,vmw[1]/2),pointsize=30)
show(plot1+plot2+plot3+plot4 + P1+P2+Z+plot4+plot5+plot6 + plot_grid_pts + plot_lattice_pts + plot_lattice_pts2, figsize = [5,5], xmin = -viewsize/2, xmax = viewsize, ymin = -viewsize/2, ymax = viewsize/2, aspect_ratio=1)
First, we are trying to prove something about squares by proving something about square roots. It works, but it means there will be many steps.
Secondly, we are not just algebraically proving it exists by solving an equation; we are forced to prove our square root exists with inequalities, which brings another set of complications.
Third, we chose to examine those inequalities geometrically to gain insight, so our proofs must use that insight β worthwhile, but stretching.
Remark 13.4.16.
Many more theorems of this kind, such as Lagrange's four square theorem, can be proved using similar techniques, which we are intentionally avoiding stating in their full generality. The names of Minkowski and Blichfeldt are associated with theorems using various symmetries and the notion of convexity in order to apply things more generally. Those who have had some physics may have heard of Minkowski before, as his work nearly beat Einstein to the notion of special relativity; his geometric framework for space-time gave Einstein the necessary apparatus to generalize to curved spacetime and general relativity.