[Agda-dev] More efficient handling of TopLevelModuleNames

Andreas Abel abela at chalmers.se
Thu Dec 4 13:31:06 CET 2014


These strings are not serialized as such, but turned to a UID during 
serialization, via a HashMap.  The HashMap is serialized as well. Upon 
deserialization, the UIDs are turned into strings again.

If we used, say 128bit hashes instead of strings, one could either 
serialize these directly, or turn into UIDs again via a HashMap.  Even 
the second alternative would probably be faster, because equality 
checking for 128bit Bytestrings should be faster than for strings.  Also 
a HashMap for hashes should be faster, since as getting n-bit hashes of 
128bit hashes is very quick: just take the first n bits.

Cheers,
Andreas

On 04.12.2014 11:31, Guillaume Brunerie wrote:
> I'm missing something.
> Why would it be faster to compute 3000000 hashes as opposed to just
> keeping the strings as is?
>
> Le 2 déc. 2014 23:15, "Andreas Abel" <abela at chalmers.se
> <mailto:abela at chalmers.se>> a écrit :
>
>     During AIM XX we investigated the high serialization times, noting
>     that for the standard library, 3000000 TopLevelModuleNames are
>     serialized (that would be e.g. a file name like
>     Data.Nat.Properties.Simple).
>
>     Nisse and I talked briefly about this problem last Friday, and one
>     idea is to use cryptographic hashes of module names instead of their
>     string representation.
>
>     Of course, there is the risk of collisions (which would be fatal).
>
>     https://en.wikipedia.org/wiki/__Birthday_attack
>     <https://en.wikipedia.org/wiki/Birthday_attack>
>
>     shows a table with collision probabilities.  Interestingly, with a
>     32-bit or 64-bit hash one can virtually store no module names if one
>     wants to stay collision free.  For a collision probability < 10^-18
>     we could only store 6 (!) different module names.
>
>     But even more interestingly, with 128 bit one can distinguish
>     already 10^10 (!) module names with sufficiently high probability.
>     This is more we ever need, most probably (at least for a while, Agda
>     developments will stay way below 10^10 files ;-)).
>
>     A 128bit hash seems to be an alternative to a String representation
>     of TopLevelModuleNames which could lead to a significant speed up of
>     serialization.
>
>
>     --
>     Andreas Abel  <><      Du bist der geliebte Mensch.
>
>     Department of Computer Science and Engineering
>     Chalmers and Gothenburg University, Sweden
>
>     andreas.abel at gu.se <mailto:andreas.abel at gu.se>
>     http://www2.tcs.ifi.lmu.de/~__abel/ <http://www2.tcs.ifi.lmu.de/~abel/>
>     _________________________________________________
>     Agda-dev mailing list
>     Agda-dev at lists.chalmers.se <mailto:Agda-dev at lists.chalmers.se>
>     https://lists.chalmers.se/__mailman/listinfo/agda-dev
>     <https://lists.chalmers.se/mailman/listinfo/agda-dev>
>

-- 
Andreas Abel  <><      Du bist der geliebte Mensch.

Department of Computer Science and Engineering
Chalmers and Gothenburg University, Sweden

andreas.abel at gu.se
http://www2.tcs.ifi.lmu.de/~abel/


More information about the Agda-dev mailing list