[Agda] Does Agda2 have fast arrays?

Tue Oct 5 00:10:58 CEST 2010

ajeffrey:
> On 10/04/2010 12:15 PM, Daniel Peebles wrote:
>> Things like stream fusion, that vector (and lots of high-performance
>> Haskell) relies on, rely on rewrite rules firing, which depend on the
>> syntactic structure of terms. So throwing unsafeCoerces in, in the wrong
>> places, will prevent the rules from firing, and would thus prevent
>> deforestation.
>
> Fair enough, Haskell's rules mechanism is just based on syntactic  
> pattern-matching, so anything that introduces additional syntax is going  
> to block the rewrites.  For some applications (e.g. the kinds of  
> high-performance computing way back in the original post) this will be a  
> deal-killer.
>
> I've been trying to replicate this experimentally, just to see how bad  
> the problem is, and what can be done about it, but I've hit a newbie  
> roadblock.
>
> A simple example is the one in the Stream Fusion paper:
>
> module Main ( main ) where
>
> import Prelude ( Int, (<=), (*), putStr, show )
> import Data.Stream ( Stream, sum, concatMap, map, unfoldr,
>   enumFromToInt )
>
> f :: Int -> Int
> f n = sum (concatMap (\ k -> map (\m -> k * m) (enumFromToInt 1 k))
>   (enumFromToInt 1 n))
>
> main = putStr (show (f 10000))
>

Well, you usually don't write in Streams directly, they're used at a concrete type.
Vector is a good example of a fusible library.

Also, concatMap doesn't fuse.  (It is the hardest thing to fuse (Stream (Stream ..)))

    -- NOTE: We can't fuse concatMap anyway so don't pretend we do.
    -- concatMap f = unstream . Stream.concatMap (stream . f) . stream

So, let's start with something simpler.

    import qualified Data.Vector as V

    f :: Int -> Int
    f n = V.sum $ V.map (\m -> n * m) (V.enumFromN 1 n)

    main = putStr (show (f 10000))

Fuses to:

    main_$s$wfoldlM'_loop :: Int# -> Int# -> Int# -> Int#

    main_$s$wfoldlM'_loop =
      \ (sc_s1io :: Int#)
        (sc1_s1ip :: Int#)
        (sc2_s1iq :: Int#) ->
        case ># sc_s1io 0 of _ {
          False -> sc2_s1iq;
          True ->
            main_$s$wfoldlM'_loop
              (-# sc_s1io 1)
              (+# sc1_s1ip 1)
              (+# sc2_s1iq (*# 10000 sc1_s1ip))
        }

The following optimizations happen:

    203 PreInlineUnconditionally
    94 PostInlineUnconditionally
    95 UnfoldingDone
    5 RuleFired
        3 SC:$wfoldlM'_loop0
        2 stream/unstream [Vector]
    13 LetFloatFromLet
    1 EtaReduction
    313 BetaReduction
    3 CaseOfCase
    65 KnownBranch
    1 CaseIdentity
    1 FillInCaseDefault
    21 SimplifierDone

And, with the -fasm backend, it becomes:

        Main_mainzuzdszdwfoldlMzqzuloop_info:
    .Lc1j9:
            testq %r14,%r14
            jg .Lc1jc
            movq %rdi,%rbx
            jmp *(%rbp)
    .Lc1jc:
            movq %rsi,%rax
            imulq $10000,%rax
            movq %rdi,%rcx
            addq %rax,%rcx
            leaq 1(%rsi),%rax
            decq %r14
            movq %rax,%rsi
            movq %rcx,%rdi
            jmp Main_mainzuzdszdwfoldlMzqzuloop_info

With the -fvia-C backend, it becomes:

   Main_mainzuzdszdwfoldlMzqzuloop_info:
            testq   %r14, %r14
            jle     .L5
    .L2:
            imulq   $10000, %rsi, %rcx
            leaq    -1(%r14), %r14
            leaq    1(%rsi), %rsi
            leaq    (%rcx,%rdi), %rdi
            jmp     Main_mainzuzdszdwfoldlMzqzuloop_info

And something similar under -fllvm.

Using ghc-core tool to inspect asm and core.

-- Don