[Ur] Compiling Ur/Web with Link-time optimization (WAS: Re: Patches to compile Ur/Web with clang)

austin seipp as at hacks.yi.org
Fri Jun 17 23:40:54 EDT 2011


Just tested the latest revision of the hg repository. Works nicely
(compiling a project with 'debug'):

-> % make UR=$HOME/code/urweb-hg/build/bin/urweb P=ref
/Users/a/code/urweb-hg/build/bin/urweb ref
clang  -Wimplicit -Werror -O3 -fno-inline -I
/Users/a/code/urweb-hg/build/include/urweb  -c /tmp/webapp.c -o
/tmp/webapp.o -g
clang -Werror -O3 -lm -pthread
-L/Users/a/code/urweb-hg/build/lib/urweb/.. -lurweb -lurweb_http
-lssl -lcrypto -lz   /tmp/webapp.o -o /Users/a/code/ur/ref.exe
-lsqlite3 -g -L/usr/local/lib
clang: warning: argument unused during compilation: '-pthread'
-> %

I guess for clang on OS X, '-pthread' is redundant, but I don't think
it's much to worry about.

So, for the whole 'link time' thing, I decided to put my money where
my mouth/brain is. I took the easy way. I booted up a Debian VM and
installed GCC 4.6 (with LTO support) and GNU gold out of the
sid/unstable tree. GCC 4.6 has relatively stable link time
optimization features that can build working things like the linux
kernel or firefox.

So, I hacked the build system to use LTO on my machine. This was
pretty easy, but required a few changes at build time and
configuration time. I tested with a statically linked copy of the
Ur/Web runtime system and the builtin HTTP server. I used the 'ref'
demo from the main site since it seemed appropriate - the compiler
will remove all the polymorphism, but there should still be a good bit
of code in the resulting executable. Furthermore, the result doesn't
necessarily use, say, all of the Ur basis functions which are written
in C, so these could be eliminated by dead code elimination at compile
time. The SQL tables were backed by sqlite.

The results were quite impressive, just from a size perspective:

a at pylon1:~/ur$ du -h ref.lto.exe
72K	ref.lto.exe
a at pylon1:~/ur$

Versus:

a at pylon1:~/ur$ du -h ref.exe
308K	ref.exe
a at pylon1:~/ur$

So the compiler was able to eliminate a *lot* of dead code! That's
pretty awesome. I haven't done any thorough speed tests with any sort
of HTTP stress tester yet, although I probably should I guess. Is
there a good benchmark for doing this?

To replicate these results, a few changes are needed:

1) In src/c/Makefile.am, you need to add the additional compiler flags
'-flto -fwhole-program' - this tells GCC to emit LTO info into object
files, and optimize every individual object file as if it was a 'whole
program'
2) Generate new makefiles
3) When you configure the Ur/Web compiler, you need to do this:

$ GCCARGS="-flto -fwhole-program -fuse-linker-plugin" CFLAGS="-O3"
./configure ...

The -flto and -fwhole-program flags are thus used on all executables
compiled by the Ur/Web compiler.

The result of all of this is that all of the C files for the runtime
and all executables compiled by the Ur/Web compiler are compiled with
LTO information. Note the use of -fuse-linker-plugin - this flag is
crucial, and it requires the gold linker (aptitude install
binutils-gold.) It allows the linker to do LTO with object files
inside archives. Since the RTS is compiled into archives, this is
crucial for any sort of real LTO to work because a good bit of code
will reside there.

Also note the need for CFLAGS="-O3" - this is because by default,
autoconf will compile at -O2, and the Ur/Web compiler compiles at -O3.
However, unless I'm mistaken, GCC requires that for LTO to work,
object files must be compiled with the same optimization settings. So
this evens things out.

A final note which is worth mentioning: I removed the Ur/Web
compiler's "-fno-inline" invocation to GCC - is there a reason for
this flag, like a bug? The resulting applications I've tried seem to
work, but I removed it because I was trying to expose as many
optimization opportunities as possible. I have not tested the LTO
changes with this flag in effect, so I don't know if it makes any
crucial difference (but this is worth checking.)

Anyway, the final compile line for the compiler looks like this with LTO:

gcc -flto=4 -fwhole-program -fuse-linker-plugin -Wimplicit -Werror -O3
-I /home/a/urweb/build/include/urweb  -c /tmp/webapp.c -o
/tmp/webapp.o
gcc -Werror -O3 -lm -pthread -flto=4 -fwhole-program
-fuse-linker-plugin /home/a/urweb/build/lib/urweb/../liburweb_http.a
/home/a/urweb/build/lib/urweb/../liburweb.a -L/usr/lib -lssl -lcrypto
/tmp/webapp.o -o /home/a/ur/ref.exe -lsqlite3

And the final compile line for the compiler without LTO is this:

gcc  -Wimplicit -Werror -O3 -fno-inline -I
/home/a/urweb/build/include/urweb  -c /tmp/webapp.c -o /tmp/webapp.o
-g
gcc -Werror -O3 -lm -pthread
/home/a/urweb/build/lib/urweb/../liburweb_http.a
/home/a/urweb/build/lib/urweb/../liburweb.a -L/usr/lib -lssl -lcrypto
/tmp/webapp.o -o /home/a/ur/ref.exe -lsqlite3 -g


The reduction in executable size is, IMO, quite good. However I
imagine a lot of this is the result of dead code elimination, and not
so much raw optimization, but I could be wrong. But these changes
bring the resulting *static* executable within ~8 kb of the
*dynamically* linked version. That's insanely good I think from a dead
code perspective:

a at pylon1:~/ur$ du -h ref.dyn.exe
64K	ref.dyn.exe
a at pylon1:~/ur$

Anyway this email is getting a bit long so I'll drop it here, but
hopefully these results seem interesting to you. Like I said, I don't
know if it *really* matters though - I don't think Ur/Web's speed is
going to be the limiting factor in web apps any time soon from what
I've seen. But hey, more speed and smaller executables is always good.

On Fri, Jun 17, 2011 at 3:34 PM, Adam Chlipala <adamc at impredicative.com> wrote:
> austin seipp wrote:
>>
>> I was also thinking about making it possible to compile the ur/web
>> runtime as an LLVM bitcode file, and do similar for programs you
>> compile with the compiler. Then  the final link step can merge the
>> files and effectively do whole program optimization, before emitting a
>> final executable. Alas, I don't know autoconf very well, nor do I know
>> automake (I'd need to make automake generate LLVM bitcode archives or
>> simply bitcode files instead of object file archives, which would
>> require some infrastructure I think.)
>>
>
> I'd be happy to add support for this, if you can tell me all of the
> appropriate changes that lie outside of Ur/Web's SML/C source code. :)
>
> A good first step would be to demonstrate the sequence of commands used to
> build the final application from all of the C sources that go into it.  To
> get the C source for an individual Ur/Web application, you can run 'urweb'
> with the '-debug' flag, in which case the C source is left in /tmp/webapp.c.
>  The C-compilation command lines Ur/Web is presently using will also be
> printed in that mode.
>
> _______________________________________________
> Ur mailing list
> Ur at impredicative.com
> http://www.impredicative.com/cgi-bin/mailman/listinfo/ur
>



-- 
Regards,
Austin



More information about the Ur mailing list