[[title
         Considering String Allocation and Copying in Tla 1.x
]]

* Is String Copying in `libawk' a Problem Worth Fixing?

  [[blockquote

    /Hypothesis:/

    *Excessive string copying is a significant (attention deserving)
    performance problem in `tla', and most of the problem is caused by
    the behavior of `libawk'.*

  ]]

* Background

  `libawk' in `tla' provides two data structures used throughout most
  of `tla': a /relational table/ (a *2-d, dynamically sized, array of
  strings*) and an /associative table/ (a *mutable associative mapping
  of strings to strings*).  Some basic relational operators are
  provided for the tables.

  These data structures are represented quite simply.  The relational
  tables are dynamically allocated arrays of pointers (each element a
  table row), pointing to dynamically allocated arrays of pointers
  (each element a column within one row), pointing to dynamically
  allocated 0-terminated strings of unsigned `char' values.  The
  associative tables are simple hash tables, with keys and values
  represented by privately allocated, 0-terminated strings of unsigned
  `char' values.

  A noteworthy property of the `libawk' implementation is that
  operations which create a new table (relational or associative) must
  newly allocate copies of all of the strings in that table,
  regardless of where the original data comes from.

  For example: a `join' operation accepts two tables as input and
  produces as output a third table.  All strings in the output table
  have identical contents to some string in one or both of the input
  tables, but `join' as currently implemented allocates a new copy of
  those strings anyway.

  Another example: suppose that I want two copies of a single table,
  each sorted using a different column as the sort key (as is common
  in some `tla' operations).  Each string in either table is
  duplicated in the other, the duplicate being a separately allocated
  copy.


* Envelope Calculations and Anecdotal Profiling/Tuning Results

  To put some rough numbers to this, I examined an arch import
  of the linux kernel prepared by John Goerzen.   I computed
  an inventory:

  [[tty
      % tla inventory --source --both --ids > ,foo
  ]]

  which produces a list of all source files which are not arch control
  files, listing the relative path to each file along with that files
  inventory id, one such pair per line.  This listing disregards all
  arch control files on the grounds that, in theory, arch might be
  modified to shrink the list of such files by a large amount.

  The size of the resulting listing is a reasonable approximation of
  the amount of memory that must be dynamically allocated to create an
  in-core table containing the inventory.  It is also approximately
  the amount of memory needed to allocate the keys and values of an
  associative table mapping file paths to ids or vice versa.

  The number of lines in that listing is basically half the number of
  calls to `malloc' needed to create such a table and half the number
  of calls to `free' needed to free such a table.

  Some performance critical operations in arch are likely to create
  (approximately but also "at least") two (distinctively sorted)
  tables of the `(file-path, inventory-id)' relationship and two
  associative tables mapping file paths to ids and ids to inventory
  ids.

  The listing size for this particular copy of the kernel has a bit
  over 30,00 lines and is just a bit over 1.5MB long.

  Four tables (two associative, two relational) will require

  [[tty

                n_allocations == 4 * (30000 * 2)
                              == 240000


                n_string_bytes == 4 * 1.5MB
                               == 6MB
  ]]

  All four tables can be constructed from a single one, returned by
  the `inventory' primitivie.  That construction, in addition to
  allocating new storage for strings, will copy the contents of each
  string three times: 4.5MB of string copying spread across 180000
  separate strings.

  These expenses are likely to be small relative to the costs of
  system calls made by arch and, especially on small trees, are
  likely to be insignificant.   Alas, there is a limit to how much
  arch's use of system calls can be optimized and my sense is that
  we are nearing that limit.

  In absolute terms (such as "wall clock time") these and related
  string and table expenses are quite noticable.  For example, several
  arch developers have obtained noteworthy performance improvements
  simply by replacing the unoptimized string copying operations with
  calls to the highly optimized versions typically found in the native
  `libc' of each host system.

  In my own experience, substitution of a silghtly slower `malloc'
  for a native malloc also has a tangible performance impact and
  quick-n-dirty profiling with `gprof' suggests that the vast majority
  of these calls originate in `libawk'.

  These calculations and experience with profiling and optimizing 
  suggest that attention to `libawk', especially with the aim of
  eliminating string allocations and copies, is likely to be rewarded
  with a somewhat faster, significantly more memory-efficient 
  `tla'.


* Secondary Reasons to Want to Change String Handling

  The performance incentives for changing string handling are 
  not the only reason to reconsider string handling in `libawk'.

  I have three other reasons (at least):

** The `libawk' Abstractions Do Not Require String Copying

  `libawk' was originally intended to provide mutable tables
  of immutable strings.   In other words, clients of the library
  should be free to add or delete rows or columns, keys and values --
  but they should not ever directly modify a string which is an
  element of a table.

  As currently presented, the `libawk' API does not enforce that
  aspect of its abstraction.  To the degree it is successfully
  honored by the rest of the code in `tla' (which is not entirely
  clear), the abstraction is honored only by careful coding with
  no particular assistence from the /C/ type system.

  Modifying the `libawk' API to enforce the immutability of 
  strings would make client code of that library easier to 
  understand, maintain, verify, and modify.

** The Representation Type of `libawk' Strings Will Probably Change Anyway

  It is conceivable that, as `tla' acquires support for
  extended character sets in filenames and other strings,
  these will be simply be represented internally as 
  Unicode strings, using the UTF-8 encoding form.

  It is also conceivable (and in my view more sensible) if 
  strings might internally be represented as Unicode strings
  using a different encoding form (such as UTF-16 or UTF-32,
  depending on circumstance).

  Regardless, either way, the shift from presumed-ASCII strings
  to strings which sometimes include extended characters is 
  likely to be a difficult one to implement.   (Indeed, one
  justification for beginning `tla 2.0' is to avoid having to make
  this shift on the `tla 1.x' code base by instead providing a
  cleaner-from-the-start foundation in `tla 2.0').

  If we want to potentiate making the shift to extended characters in
  `tla 1.x', modifying the `libawk' API so that internal string
  representations are better hidden is a definate help.

  Currently, much code in `tla 1.x' assumes that, for example, 
  the string in the third column of the second row of a table
  can be accessed as a value of type `t_uchar *' using a 
  construct such as:

  [[tty
        t_uchar * table_elt = table[1][2];
  ]]

  Eliminating such constructs by hiding them behind a procedural
  abstraction (i.e., not having clients use array subscripting but,
  instead, having them call a function to access elements)
  would be progress on both eliminating needless string copying
  /and/ preparing the way for extended character sets.

** Hash-consed and/or Constructive Strings May be Desirable

  [This is, by far, the most speculative rationale for 
  changing `libawk'.]

  String comparison is an important operation in `tla', both 
  comparison for equality and comparison for lexical order.

  Unicode, necessarily, makes comparison operations considerably
  more expensive than their ASCII-only analogs.

  One *sometimes applicable* technique for reducing the expense of
  such operations in a string-intensive application is to (a) use only
  immutable strings; (b) never allocate two strings with identical
  contents (always share a single string) -- this gives rise to a fast
  equality test; (c) memoize the results of all string comparisons for
  order, doing so in a manner that allows a quick inference of the
  order between two previously uncompared strings provided that they
  have both been compared to some other string or to two other strings
  whose order is known (or less expensively inferrable).

  Another *sometimes applicable* technique for reducing string copying
  and, in particular, speeding up operations such as substring 
  extraction and string concatenation is to use what might be called
  "constructive strings" (sometimes called "ropes" after an
  implementation by Hans Boehm).

  Using reference-counted immutable strings in `libawk' opens the 
  door to trying out either of these techniques.


* Program for Research

  I am interested in finding an easy way to eliminate `libawk'-related
  string copying.

  There are many ways a `libawk'-style API might work without so 
  much string copying but I think that the most straightforward
  of these would be to:

  [[blockquote
     /1. Reference-count `libawk' Strings/  Strings do not 
     contain references to other objects and thus are not
     subject to being part of data structure containing a 
     cycle of references.   For that reason, reference counting
     is a natural fit for memory managing them.

     /2. Make the `libawk' Types Opaque/  The internal 
     representations in `libawk' should be hidden from 
     clients.
  ]]

  My intention, then, isf to proceed first by making such changes to
  `libawk'.   This will immediately break a good deal of code 
  throughout `tla'.   The breakage will be of a variety that a 
  /C/ compiler will report it -- this is convenient because the 
  compiler will help to trace down all code which must be modified
  to honor the corrected `libawk' abstractions.


* Current Questions

  Have other Arch developers tried a similar experiment already?

  Do other Arch developers agree with the conclusions I draw from
  my envelope calculations and the anecdotal profiling and tuning
  results?

  What is the comfort-level among Arch developers regarding the
  prospects of migrating such a change, which is likely to be
  very large in scope, from a `tla-1.3.1' branch to other branches
  such as `baz1.x' and `tla-1.4'?



* Copyright

 /Copyright (C) 2004 Tom Lord/

 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation; either version 2, or (at your option)
 any later version.

 This program is distributed in the hope that it will be useful,
 but /WITHOUT ANY WARRANTY/; without even the implied warranty of
 /MERCHANTABILITY/ or /FITNESS FOR A PARTICULAR PURPOSE/.  See the
 GNU General Public License for more details.

 You should have received a copy of the GNU General Public License
 along with this program; if not, write to the Free Software Foundation,
 Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

 See the file `COPYING' for further information about
 the copyright and warranty status of this work.



[[null
   ; arch-tag: Tom Lord Wed Feb  9 11:42:00 2005 (gnu-arch/string-copying-notes.txt)
]]

