[Agda] notes on type check performance

Fri May 12 20:43:29 CEST 2017

Dear list,

These are several notes concerning the type check performance.

I. Sharing representation
=========================

Probably the current problem is implementing a real sharing
representation of a term.
For example, to make the type checker work with

    let T = (A → A) → (A → A) → (A → A)    
    in 
    T × T

instead of its expanded form.

Agda has the  --sharing  option, but it helps in rare cases, at least,
no in my DoCon-A project.

II. One of the principal exponent cost source
=============================================

The following is about the source of the type check cost which has a 
fundamental origin and is common to Agda and Coq.

- On 2017-01-13  Robbert Krebbers wrote:  --------

This may be related: having large chains of coercions is also
problematic in Coq, see [1]: There is also a severe efficiency issue:
the complexity of Coq's term comparison algorithm is exponential in the
length of the coercion chain. While this is clearly a problem specific
to the current Coq implementation, it is hard and unlikely to be
resolved soon, so it seems prudent to seek a design that does not run
into it. 
The paper [1] describes another way of organizing algebraic hierarchies 
in Coq so that these long chains of coercions can be avoided. It quite 
crucially relies on Coq's canonical structures mechanism, so I have no 
clue whether you can practically do something similar in Agda. 

[1] François Garillot, Georges Gonthier, Assia Mahboubi, Laurence
Rideau. Packaging Mathematical Structures. 
TPHOLs'09. https://hal.inria.fr/inria-00368403v2 
----------------------------------------------------------------------

The paper [1] writes in introduction:

"
While this unbundling allows for maximal flexibility, it also induces a
proliferation of arguments that is rapidly overwhelming. A typical 
algebraic structure, such as a ring, involves half a dozen constants and
even more axioms. Moreover such structures are often nested, e.g., for 
the Cayley-Hamilton theorem one needs to consider the ring of
polynomials over the ring of matrices over a general commutative ring.
The size of the terms involved grows as C^n , where C is the number of
separate components of a structure, and  n  is the structure nesting
depth. For Cayley-Hamilton we would have  C = 15  and  n = 3,  and thus
terms large enough to make theorem proving impractical, given that
algorithms in user-level tactics are more often than not nonlinear.
"

I am trying to understand this example, and thus to understand the
nature of this particular source of the exponent cost grow effect.

It is about the construction of

      Pol (SquareMatrix (R : CommutativeRing)),

and let us imagine that this is coded in Agda.
Here  CommutativeRing  is a record defining a notion of an abstract 
commutative ring - something similar to CommutativeRing of Agda Standard
library.

a) CommutativeRing  has the operations  
   0#, 1#, _+_, _-_, _*_, zeroLaw, unityLaw, +oppositeLaw, +-assoc, 
   +-comm, *-assoc, *-comm, distributivityLawFor+*,

may be something else. All right, let us assume that there are about   
15  operations ("constants and axioms", as the paper writes).

b) SquareMatrix  over any  (R : CommutativeRing)
   is a data structure to be defined in application.
This is something like  
                  let C = Carrier R in  List (Vec C n),

with the condition of length of the vector list equal to  n.

If  R : CommutativeRing,  then indeed, for   SquareMatrix R
there can be implemented the operations  
      #0, #1, _+_, _-_, _*_, zeroLaw, unityLaw, +oppositeLaw, 
      +-assoc, +-comm; *-assoc, distributivityLawFor+*.

Here _*_ is multiplication of two  n × n  matrices, and the
implementation actually proves that these matrix operations satisfy the
axioms of Ring. 

This makes it the instance  squareMatrixRing : Ring.

c) For any (R : Ring) an univariate polynomial domain  Pol R  
   can be defined as a data structure.
There can be defined the operations  0#, 1#, _+_, _*_  for  Pol R. 
If R is commutative by _*_, then  Pol R  can be proved a certain 
CommutativeRing instance.
If R is not commutative, then, I expect, probably,  Pol R  can be proved
a certain  Ring  instance.

d) So, we have that the construct of 
                        Pol (SquareMatrix (R : CommutativeRing))

is expressed in the Agda program as 

   ∀ R : CommutativeRing,    -- abstract
   ∀ n : ℕ,  n > 0, 

   mR  = (squareMatrixRing R n)  : Ring,
   pmR = (polynomialRing mR)     : Ring.

About 15 operation instances are implemented in  mR,
Implementing _+_ for mR  needs to use only  0# and _+_  of R.
Implementing _*_ for mR  needs to use almost all operations of R,
this makes it a certain term depending of 15 argument operations.

Further, about 15 operation instances are implemented in  pmR,
Implementing _*_ for pmR  uses almost all operations of mR.

Assume that each operation instance implementation  op  in  mR  is 
represented merely as a _tree_  where each node expresses an operation. 
This node has several outbound edges, each edge points to some operation
op' in R which is used in the implementation of  op. 

The paper authors write: 
"The size of the terms involved grows as C^n".

I understand this so that the size of the trees involved in the 
described above Agda program for  
                             pmR = (polynomialRing mR)  : Ring
is bounded by  15 ^ 3
- provided that the above trees are stored in memory in expanded
unfolded form.
In various unlucky examples this upper bound of  15^3  is reached.

Assume roughly that each operation in the  mR  instance is a term in 
which occur 15 operations from R
(like it is for the _*_ instance).
So the tree size for  mR  (the type expression size) is  15^2.

Assume roughly that each operation in the  pmR  instance is a term in 
which occur 15 operations from mR (like it is for the _*_ instance).

Then, we have that that the tree for  pmR  has  15^3  leaves.

And note that in this term the terms for  mR._+_ and mR._*_  are copied
many times. And the implementation term for  rm._*_  is large.

I guess that this is what the authors wrote about.

II.2. What DoCon-A has with this respect
========================================

* DoCon-0.04.1  has the generic tower from  Semigroup and Monoid up to 
CommutativeRing, IntegralRing, FactorizationMonoid, FactorizationRing, 
UniqueFactorizationRing, Field.

* FactorizationRing  relies on the structure of Multiset and on the 
instance of  CommutativeMonoid for  (Multiset D)  for D : DecSetoid and
for the sum operation on multisets.
Various theorems rely on the instance of  CommutativeMonoid Nonzero 
for multipication on nonzero elements in any  R : IntegralRing.

* There are implemented all the above instances for Integer, except
Field.
And the (FactorizationRing Integer) instance is implemented via 
(FactorizationNonoid (Nat \ 0)).

This concerns _only an ordinary textbook on basic algebra_, a book of
about, say, 200 pages.

For this particular textbook, the depth of term for the instances
involved is bounded. 
The depth in the above example from [1] is  3.

(1) A similar depth bound in  DoCon-A-0.04.1  is somewhat about  5.

(2) Now, DoCon-A-0.04.1 needs  8 Gb memory and 64 minutes to type-check 
on a 3 GHz machine.

(3) On the other hand, the corresponding definitions and rigorous 
constructive proofs take only about 200 pages in a textbook, and a human
checks these proofs by reading this book. A human will check it,
probably, in 20 days. But the type checker processes the text millions
times faster. 

We see that the checker of (3) takes a small resource on this term of
size C^5, while that checker of (2) takes a large resource.

May be, this inefficiency can be improved by introducing the let-in 
representation of terms.

The paper [1] describes a certain subtle approach taken in Coq which 
handles the problem. So far, I do not understand it
(I need to look further into the paper). 
And I wonder, why did not they simply apply the let-in representation of
terms.

II.3. Comparing to Glasgow Haskell
==================================

DoCon (written in Haskell) has been tested on computing in the domains 
like
      Matrix (Pol [x, y] ((Pol t (Fraction Integer)) / (f))),

which have a greater term size.
And there are examples in practice that have sense and that have even
more levels in the domain construct. And we need to assume that the
depth in this tower of constructs is not bounded - similarly as the data
item in Agda may have arbitrary many constructor occurrences.

Glasgow Haskell compiles fast the above constructs of a great depth 
(with nested classes and instances).
Though, without dependent type, the problem is, probably, easier.

Regards,

------
Sergei