+ Fixes to make:

 -Bdirect: cuml. dlopen times:
0.676526
0.695816
0.703797
 normal: culm dlopen times:
1.811801
1.830609
1.859336

kcachgrind: 10million L1 cache misses, 2.3million L2 misses
LD_DEBUG=stats - 64k named relocations
=160 L1 cache misses each, & 31 L2 misses each.

Afterwards:
    Linking 'only' ~15% of the time (vs. 50%)
	+ still the largest single thing though.


    + suse-dynsort: FIXME: need to free this data too ...


The dynsort harness: -zdynsort, -hashvals:
UnSorted
Iter 39.66 ms
Iter 39.60 ms
Iter 41.17 ms
Iter 39.49 ms
Sorted
Iter 14.18 ms
Iter 14.22 ms
Iter 14.12 ms
Iter 15.06 ms
     ie. a ~60% speedup [ no benefit from -Bdirect either ;-]

-Wl,-Bdirect -Wl,-z dynsort -Wl,-hashvals

** Generating a new '.suse.hashvals' section **
   DT_SUSE_HASHVALS 0x60000000d 0x6ffff000
   + Random no. in range:
	SUSE_LO 6cbdd030
	SUSE_HI 6cbdd031

VCL diffs:
    Iter 192.02 ms
    Iter 192.23 ms
    Iter 192.22 ms

    Iter 94.25 ms
    Iter 93.78 ms
    Iter 94.96 ms


   + Easy to output
	+ re-factoring glibc is a *total* PITA ...

	dl-reloc.c - calls 'RESOLVE_MAP'

(gdb) p map->l_info[0x4b]
$2 = (Elf32_Dyn *) 0x0
(gdb) p *map
$3 = {l_addr = 1073872896, l_name = 0x400009a8 "/usr/lib/libstdc++.so.6",	

** 


Improving binutils' symbol & reloc sorting efficiency:

// Improved binutils symbol sorting for cache efficiency ...
    + Unfortunately -Wl,-combreloc exports & re-imports
      the relocs before sorting them [ most odd ]

    + We sort .dynsym nicely (by elf hash)
	+ we fail to sort .dynstr
	    - in sequence of suffix order [!?]
	+ elf-strtab.c (_bfd_elf_strtab_finalize)
	    + does a sort - but only on a copy ...
		+ also assigns hard addresses [!]
	    + ultimately in bfd_strtab_add order ...
		+ can we sort earlier & influence that ?
	    + red herring:
	    	    + '_bfd_elf_export_symbol' ...
	    + _bfd_elf_merge_symbol (?)
	    + *Unfortunately*
		+ the sym table is already built & swapped out (?)

    + we want relocs sorted by absolute elf hash value
	+ Unfortunately - we don't have the symbol foo here
	+ which *really* sucks ...
	+ The relocs are typically just copied from the source
	  though so ... _bfd_elf_link_output_relocs ...
	    + the SymStrTab + contains all the symbols
	    + in the right order - given an index each
	    + unfortunately inaccessible by 'index' ...
    + Back the strhash with the actual string data ? [!?]
	+ thus avoiding foos ?

    + and we want symbols sorted by % hash_table_size
        + HACK: _bfd_elf_link_renumber_dynsyms [!?]
	    + sort as we renumber ?
	+ bfd_elf_final_link
	    + elf_link_hash_traverse ... elf_link_output_extsym
	    + [ elf_sort_symbol ? ]

    ** Could we have a flat array of entries ?
	+ we need Index [ it's a byte index ... ]
	+ urgh ...
		+ st_name is a byte offset into foo


    ** Problems:
	+ dynsym index order is: (bfd_elf_final_link):
	    + dummy_symbol [elf_link_output_sym ...]
	    + elf_link_hash_traverse (elf_link_output_extsym...)
		localsyms = true
	    + abfd->sections order
	    + local dynsyms ...
	    + elf_link_hash_traverse (elf_link_output_extsym)
		localsyms = false
    ** To Do:
	+ 'simply' - pre-quicksort on 
	    (elf_link_hash_entry):
	    ** h->u.elf_hash_value % hash_size [!?]
	+ iterate over that instead of the hash ...
	+ also sorts the strings by elf hash incidentally :-)

	+ bucketcount = elf_hash_table (finfo->info)->bucketcount;
	+ bucket = h->u.elf_hash_value % bucketcount;

	    + compute_bucket_count ...
		+ builds 'hashcodes' list ...
		+ used only locally ...

    ** What about _bfd_elf_link_renumber_dynsyms ...
	+ what about hash table removals
	    - surely they change the order of traversal ?

** Hack elf_link_hash_traverse
	+ in 2 modes: pre-sort & post sort.

** We want to sort before _bfd_elf_link_renumber_dynsyms:
	from bfd_elf_size_dynamic_sections ...
	    + bucketcount calculated on line 5787
		+ bfd_elf_size_dynamic_sections [4987->5816]
	    + renumber_dynsyms is line 5715
		+ [urk]

dltest.c - simple dlopen 100 times of libvcl:
Iter 0.12241 us
Iter 0.12255 us
Iter 0.12828 us

sorted alphabetically:
Iter 0.12222 us
Iter 0.12506 us
Iter 0.12137 us

just libvcl sorted by elf_hash % bucketcount:
Iter 0.11994 us
Iter 0.11950 us
Iter 0.11959 us

more libvcl ldd output re-linked:
     sal, libutl, libsot

unchanged:
Iter 0.36873 us
Iter 0.36727 us
Iter 0.3777 us
Iter 0.39592 us
Iter 0.41514 us -- outlier
Iter 0.38233 us

Avg: 0.3784 ( + 0.3845 with outlier)

changed:
Iter 0.35362 us
Iter 0.35221 us
Iter 0.3547 us
Iter 0.37611 us
Iter 0.38871 us
Iter 0.37706 us

Avg: 0.3671

** sorting relocs by elf hash
   + store the hash value as we write it out ?!
   + store an array of 'h' pointers per reloc ?
...
    + then we have an array of foo ...
	+ or ? ... symbol foo...
    + OR - store the symbol strings in a chunk
      instead & index by offsets ?
	+ can realloc take that ?
	+ can we work out the size in advance ?
	+ can we get a very-slow prototype ?
    + foo:
	+ just read-in the elf string tab anyway.
	+ it got emitted by bfd_stringtab_emit just above anyway.

    + relocations - point into dynsym ? - if so - horay ;-)
	+ we can use the previous sort / foo table too ?
	+ otherwise we screwed that up nastily [?]
	    + hmm.

    + relocs must have a dynsym ptr - since that's
      what we accelerate -Bdirect with ...

** TODO
	+ Check that we do elide partially matching symbols ...
	+ run 'make check' (?)
	    + 3 unexpected failures
		+ all related to having re-sorted syms / relocs

+ Issues:
	+ qsort 'global' closure ?
	    + Pre % the elf symbol hash data with the
	      bucket count ! :-)
	+ add an elf_link_hash_entry pointer and pass
	  that (if we can) from bfd_elf_link_record_dynamic_symbol ?
	+ then we can avoid re-calculating the symbol hash etc.
	  also get the foo to baa more nicely ... (?)


** Way better:
    + 30% faster ...
	[ for my 1 test ;-]

** For just libsvx: - each avg of 10 runs:

Unsorted:
Iter 549 ms
Iter 552 ms
Iter 550 ms
     Avg: 550ms

Sorted:
Iter 536 ms
Iter 540 ms
Iter 540 ms
     Avg: 539ms


** For libvcl: avg of 50 runs
    + with 5 of it's deps re-linked:

Unsorted:
Iter 88.45 ms
Iter 88.27 ms
Iter 89.89 ms
Iter 88.65 ms
     + 88.82ms

Sorted:
Iter 85.11 ms
Iter 85.22 ms
Iter 85.43 ms
Iter 84.57 ms
     + 85.08ms


** Analysis of the test harness:

L2 data read misses:

do_lookup_x    362k -> 130k
strcmp	       159k -> 53k
_dl_elf_hash   158k -> 147k

L1 data read misses:

_dl_relocate_obj    1649k -> 587k
do_lookup_x	    923k -> 295k
_dl_elf_hash	    187k -> 162k
strcmp		    216k -> 80k

michael@linux:/opt/OOInstall/program> for a in 0 1 2 3 4 5 6 7 8 9; do /opt/gcc/src/dynsort-test/dltest 10 ./libsvx680li.so; done
Iter 982.57 ms
Iter 1028.00 ms
Iter 1029.86 ms
Iter 1019.55 ms
Iter 1047.68 ms
Iter 1031.04 ms
Iter 1051.30 ms
Iter 1029.33 ms
Iter 1016.63 ms
Iter 615.35 ms
michael@linux:/opt/OOInstall.sorted/program> for a in 0 1 2 3 4 5 6 7 8 9; do /opt/gcc/src/dynsort-test/dltest 10 ./libsvx680li.so; done
Iter 938.20 ms
Iter 962.02 ms
Iter 941.60 ms
Iter 944.48 ms
Iter 931.99 ms
Iter 942.54 ms
Iter 953.35 ms
Iter 939.51 ms
Iter 962.91 ms
Iter 949.63 ms


Not having 'UNDEF' slots in the chain:
no slots:
Iter 3753.08 ms
Iter 3602.53 ms
Iter 3654.38 ms
Iter 3575.02 ms
Iter 3661.06 ms
Iter 3659.29 ms
slots:
Iter 3526.86 ms
Iter 3634.79 ms
Iter 3666.99 ms
Iter 3595.30 ms
Iter 3685.82 ms
Iter 3681.00 ms

michael@linux:/opt/OOInstall/program> objdump -T libsvx680li.so | grep '\*UND\*' | wc -l
5219
michael@linux:/opt/OOInstall/program> objdump -T libsvx680li.so | grep -v '\*UND\*' | wc -l
48289
