"D"iving Into the D Programming Language

betty_betty2008

浏览: 23947 次
性别:
来自: 东莞

最近访客更多访客>>

z998998

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

C C++C#Python LISP

"D"iving Into the D Programming Language
By Andrei Alexandrescu

Date: Jul 29, 2009

Book Excerpt is provided courtesy of Addison-Wesley Professional.

Return to the article

--------------------------------------------------------------------------------
Andrei Alexandrescu dives into explaining the basics of the D programming language.
--------------------------------------------------------------------------------

You know what's coming first, so without further ado:

import std.stdio;
void main() {
    writeln("Hello, world!");
}Depending on other languages you know, you might have a feeling of déjà vu, a mild satisfaction that this little task doesn't already acquaint you with one third of the language's keywords, or perhaps a slight disappointment that D didn't go the scripting languages' route of allowing top-level statements. (Top-level statements invite global variables, which quickly turn into a liability as the program grows; D does offer ways of executing code outside main, just in a more structured manner.) If you're a stickler for precision, you'll be relieved to hear that void main is equivalent to an int main that returns “success” (code zero) to the operating system if it successfully finishes execution.

But let's not get ahead of ourselves. The purpose of the traditional “Hello, world” program is not to discuss a language's capabilities, but instead to get you started on writing and running programs using that language. So, after you typed the code above in a file called, say, hello.d, fire a shell and type the following commands:

$ dmd hello.d
$ ./hello
Hello, world!
$ _where '$' stands in for your favorite command prompt. If your operating system supports the so-called "shebang notation" for launching scripts, you'll be glad to hear that simply adding the line:

#!/usr/bin/rdmdto the very beginning of your hello.d program makes it directly executable. After you operate that change, you can simply type at the command prompt:

$ chmod u+x hello.d
$ ./hello.d
Hello, world!
$ _(You only need to do the chmod thing once.) The rdmd program (part of the D distribution) is smart enough to cache the generated executable, such that compilation is actually done only after you've changed the program, not every time you run it. From here on, the short programs featured in this book won't specify this detail, leaving it up to you to use the shebang feature '#!' if you so prefer.

The program itself starts with the directive:

import std.stdio;which instructs the compiler to look for a module called std.stdio and make its symbols available for use. import is akin to the #include preprocessor directive found in C and C++, but is closer in semantics to Python's import: there is no textual inclusion taking place—just a symbol table acquisition. Repeated imports of the same file are of no import.

Per the venerable tradition established by C, a D program consists of a collection of declarations spread across multiple files. The declarations can introduce, among other things, types, functions, and data. Our first program defines the main function to take no arguments and return "nothingness"—void, that is. When invoked, main calls the writeln function (which, of course, was cunningly defined by the std.stdio module), passing it a constant string. The "ln" suffix indicates that writeln appends a newline to the printed text.

The following sections provide a quick drive through Deeville. Little illustrative programs introduce basic language concepts. At this point the emphasis is on conveying a feel for the language, rather than giving pedantic definitions. If your eyes start glazing over some details, feel free to skip the tricky sections for now. Later chapters will treat each part of the language in greater detail, after which it's worth giving this chapter a more thorough pass.

1.1 Numbers and Expressions
Ever curious how tall foreigners are? Let's write a simple program that displays a range of usual heights in feet+inches and in centimeters.

/*
Compute heights in centimeters for a range of heights
expressed in feet and inches
*/
import std.stdio;

void main() {
// values unlikely to change soon
immutable inchPerFoot = 12;
immutable cmPerInch = 2.54;

// loop'n write
foreach (feet; 5 .. 7) {
    foreach (inches; 0 .. inchPerFoot) {
      writeln(feet, "'", inches, "''\t",
        (feet * inchPerFoot + inches) * cmPerInch);
    }
}
}When executed, this program will print a nice two-column list:

5'0''   152.4
5'1''   154.94
5'2''   157.48
...
6'10'' 208.28
6'11'' 210.82Just like Java, C++, and C#, D supports /*multi-line comments*/ and //single-line comments (plus a couple others, which we'll get to later). One more interesting detail is the way our little program introduces its data. First, there are two constants:

immutable inchperfoot = 12;
immutable cmperinch = 2.54;Constants that will never, ever change are introduced with the keyword immutable. Constants, as well as variables, don't need to have a manifest type; the actual type can be inferred from the value with which the symbol is initialized. In this case, the literal 12 tells the compiler that inchperfoot is a 32-bit integer (denoted in D with the familiar int); similarly, the literal 2.54 causes cmperinch to be a 64-bit floating-point constant (of type double). Going forth, we notice that the definitions of feet and inches avail themselves of the same magic, because they look like variables all right, yet have no explicit type adornments. That doesn't make the program any less safe than one that stated:

immutable int inchperfoot = 12;
immutable double cmperinch = 2.54;
...
foreach (int feet; 5 .. 7) {
   ...
}and so on, only less redundant. The compiler allows omitting type declarations only when types can be unambiguously inferred from context. But now that types came up, let's pause for a minute and see what numeric types are available.

In order of increasing size, the signed integral types include byte, short, int, and long, having sizes of exactly 8, 16, 32, and 64 bits respectively. Each of these types has an unsigned counterpart of the same size, named following a simple rule: ubyte, ushort, uint, and ulong. (There is no "unsigned" modifier as in C.) Floating point types consist of float (32 bits IEEE-754 single-precision number), double (64 bits IEEE-754), and real (which is as large as the machine's floating-point registers can go, but no less than 64 bits; for example, on Intel machines real is a so-called IEEE 754 double-extended 79-bit format). Automatic conversions among these types exist, with the restriction that an integral can never be automatically converted to one of fewer bits. However, floating-point numbers do allow automatic conversion to narrower floating-point numbers, decision that may seem surprising to the unwary. The explanation of this choice is a bit involved; the short version is that implicit narrowing conversions allow the compiler to manipulate intermediate calculation results at optimal size and precision, which ultimately means faster and more accurate floating-point calculations. The rule of thumb is that the compiler guarantees precision at least as good as required by the type specified in user code (i.e., float, double, or real). By giving the compiler the freedom to internally operate at higher precision, the pernicious effects of cumulative errors and intermediate overflows are greatly reduced.

Getting back to the sane realm of integral numbers, literals such as 42 can be assigned to any numeric type, with the mention that the compiler checks whether the target type is actually large enough to accommodate that value. So the declaration:

immutable byte inchperfoot = 12;is as good as the one omitting byte because 12 fits as comfortably in 8 bits as in 32. By default, if the target type is to be deduced from the number (as in the sample program), integral constants have type int and floating-point constants have type double.

Using these types, you can build a lot of expressions in D using arithmetic operators and functions. The operators and their precedence are much like the ones you'd find in D's sibling languages: '+', '-', '*', '/', and '%' for basic arithmetic, '==', '!=', '<', '>', '<=', '>=' for comparisons, fun(argument1, argument2) for function calls, and so on.

Getting back to our centimeters-to-inches program, there are two noteworthy details about the call to writeln. One is, it's invoked with five arguments (as opposed to one in the program that opened the hailing frequencies). Much like the I/O facilities found in Pascal (writeln), C (printf), or C++ (cout), D's writeln function accepts a variable number of arguments (it is a "variadic function"). In D, however, users can define their own variadic functions (unlike in Pascal) that are always typesafe (unlike in C) without needing to gratuitously hijack operators (unlike in C++). The other detail is that our call to writeln awkwardly mixes formatting information with the data being formatted. Separating data from presentation is often desirable, so let's use the formatted write function writefln instead:

writefln("%s'%s''\t%s", feet, inches,
(feet * inchperfoot + inches) * cmperinch);The newly-arranged call produces exactly the same output, with the difference that writefln's first argument describes the format entirely. '%' introduces a format specifier similar to C's printf, for example '%i' for integers, '%f' for floating point numbers, and '%s' for strings.

If you've used printf, you'd feel right at home, were it not for an odd detail: we're printing ints and doubles here—how come they are both described with the '%s' specifier, which traditionally describes only strings? The answer is simple. D's variadic argument facility gives writefln access to the actual argument types passed, setup that has two nice consequences: (1) the meaning of '%s' could be expanded to "whatever the argument's default string representation is," and (2) if you don't match the format specifier with the actual argument types, you get a clean-cut error instead of the weird behavior specific to misformatted printf calls (to say nothing about the security exploits made possible by printf calls with untrusted format strings).

1.2 Statements
In D, just like in its sibling languages, any expression followed by a semicolon is a statement (for example, the sample program's call to writefln has a ';' appended). The effect of the statement is to simply evaluate the expression.

D is a member of the "curly-braces block scoped" family, meaning that you can group several statements into one by surrounding them with '{' and '}'—something necessary, for example, when you want to do several things inside a foreach loop. In the case of exactly one statement, you can omit the curly braces entirely. In fact, our entire height conversion double loop could be rewritten as follows:

foreach (feet; 5 .. 7)
foreach (inches; 0 .. inchperfoot)
    writefln("%s'%s''\t%s", feet, inches,
    (feet * inchperfoot + inches) * cmperinch);Omitting braces for single statements has the advantage of shorter code and the disadvantage of making edits more fiddly (during code maintenance, you'll need to add or remove braces as you mess with statements.) People tend to get pretty divided when it comes about rules for indentation and for placing curly braces. In fact, so long as you're consistent, these things are not as important as they might seem, and as a proof, the style used in this book (full bracing even for single statements, opening braces on the introducing line, and closing braces on their own lines) is, for typographical reasons, quite different from the author's style in everyday code. If he could do this without turning into a werewolf beyond all reasonable doubt, so could anyone.

The Python language made popular a different style of expressing block structure by means of indentation—"form follows structure" at its best. Whitespace that matters is an odd proposition for programmers of some other languages, but Python programmers swear by it. D normally ignores whitespace, but is especially designed to be easily parsed (e.g. parsing does not need to understand the meaning of symbols), which suggests that a nice pet project could implement a simple preprocessor allowing usage of Python indentation style with D without suffering any inconvenience in the process of compiling, running, and debugging programs.

Finally, the code samples above already introduced the if statement. The general form should be very familiar:

if (expression) statement1 else statement2
A nice theoretical result known as the theorem of structure [1] proves that we can implement any algorithm using compound statements, if tests, and loops à la for and foreach. Of course any realistic language would offer more than just that, and D is no exception, but for now let's declare ourselves content as far as statements go, and move on.

1.3 Function Basics
Let's go beyond the required definition of the main function and see how to define functions in D. Function definitions follow the model found in other Algol-like languages: first comes the return type, then the function's name, and finally the formal parameters1 as a parenthesized comma-separated list. For example, to define a function called fun that takes an uint and a double and returns an int, you'd write:

int fun(uint x, double y) {
...
}Each function parameter (x and y in the example above) has, in addition to its type, an optional storage class that decides the way arguments are passed the function when invoked. By default, arguments are passed into fun by value. If storage class ref is prepended to parameter's type, the parameter is bound directly to the incoming argument such that changing the parameter is directly and immediately reflected in the value fun received from the outside. For example:

void fun(ref uint x, double y) {
   x = 42;
   y = 3.14;
}
void main() {
   uint a = 1;
   double b = 2;
   fun(a, b);
   writeln(a, " ", b);
}This program prints 42 2 because x is a ref uint, meaning that assigning to x really means assigning to a. On the other hand, assigning to y has no effect on b because y is a private copy at fun's disposal.

The last adornments we'll discuss in this brief introduction are in and out. Simply put, in is a promise on the part of the function that it only wants to look, not touch, the parameter. Finally, using out with a function parameter works similarly to ref, with the amendment that the parameter is forcibly initialized to its default value upon function's entry. (Each type T defines an initial value, denoted as T.init. User-defined types can define their own init.)

There is a lot more to say about functions. You can pass functions to other functions; nest them into one another; allow a function to save its local environment (full-fledged syntactic closures); create and comfortably manipulate unnamed functions (lambdas); and some additional juicy little bits. We will get to each of these in good order.

1.4 Arrays and Associative Arrays
Arrays and associative arrays (the latter colloquially referred to as hashtables or hashes) are arguably the most used compound data structures in the history of computing, enviously followed by Lisp's lists. A lot of useful programs need no more than some sort of array and associative array, so it's about time to see how D implements them.

Building a Vocabulary
For example, let's write a simple program following the specification:

Read a text consisting of words separated by whitespace, and associate a unique number with each distinct word. Output lines of the form ID word.
This little script can be quite useful if you want to do some text processing; once you have built a vocabulary, you only need to manipulate numbers (cheaper), not full-fledged words. A possible approach to building such a vocabulary is to accumulate already-seen words in a sort of a dictionary that maps words to integers. When adding a new mapping we only need to make sure the integer is unique (a solid option is to just use the current length of the dictionary, resulting in the IDs 0, 1, 2. . . ). Let's see how we can do that in D.

import std.stdio, std.string;

void main() {
   uint[string] dic;
   foreach (line; stdin.byLine) {
      // Break sentence into words
      string[] words = split(strip(line));
      // Add each word in the sentence to the vocabulary
      foreach (word; words) {
         if (word in dic) continue; // nothing to do
         uint newID = dic.length;
         dic[word] = newID;
         writeln(newID, '\t', word);
      }
   }
}In D, the type of an associative array (a hashtable) that maps values of type K to values of type V is denoted as V[K]. So the variable dic of type uint[string] maps strings to unsigned integers—just what we needed to store word-to-ID mappings. The expression word in dic is true if the key word could be found in associative array dic. Finally, insertion in the dictionary is done with dic[word] = newID.

The variable words is an example of an array in action. Dynamically-sized arrays of T are denoted as T[] and are allocated in a number of ways, such as:

int[] a = new int[20]; // 20 zero-initialized integers
int[] b = [ 1, 2, 3 ]; // an array containing 1, 2, and 3
Our little utility doesn't need to allocate words because split takes care of that. Unlike C arrays, D arrays know their own length, accessible as arr.length for any array arr. Assigning to arr.length reallocates the array. Array accesses are bounds checked; code that enjoys risking buffer overruns can scare the pointer out of the array (by using arr.ptr) and then use unchecked pointer arithmetic. Also, a compiler option disables bounds checking if you really need everything that silicon wafer could give. This places the path of least resistance on the right side of safety: code is safe by default and can be made a tad faster with more work. Besides, many D idioms naturally obviate the need for bounds checking. For example, here's how to iterate over an array using a new form of the already-familiar foreach statement:

int[] arr = new int[20];
foreach (elem; arr) {
   /* ... use elem ... */
}The loop above binds elem to each element of arr in turn, and there's never a need to check bounds (even if you do reallocate arr inside the loop, for a subtle reason that we'll discuss in Chapter ??). Assigning to elem does not assign back to elements in arr. To change the array, just use the ref keyword:

// Zero all elements of arr
foreach (ref elem; arr) {
   elem = 0;
}And now that we got to how foreach works with arrays, let's mention one more useful thing. If you also need the index of the array element while you're iterating, foreach can do that for you:

int[] months = new int[12];
foreach (i, ref e; months) {
   e = i + 1;
}The code above creates an array containing 1, 2,. . . , 12. The loop is equivalent to the slightly more verbose code below:

foreach (i; 0 .. months.length) {
   months[i] = i + 1;
}D also offers statically-sized arrays denoted as e.g. T[5]. Outside a few specialized applications, dynamically-sized arrays are to be preferred because more often than not you don't know the size of the array in advance.

Arrays have shallow copy semantics, meaning that copying an array variable to another does not copy the entire array, just spawns a new view to the same underlying storage. If you do want to obtain a copy, just use the dup property of the array:

int[] a = new int[100];
int[] b = a;
++b[10];    // b[10] is now 1, as is a[10]
b = a.dup; // copy a entirely into b
++b[10];    // b[10] is now 2, a[10] stays 1Array Slicing. Type-Generic Functions. Unit Tests
Array slicing is a powerful feature that allows referring to a portion of an array without actually creating a new one. To exemplify, let's write a function binarySearch implementing the eponymous algorithm: given a sorted array and a value, binarySearch quickly returns a Boolean value telling whether the value is in the array. D's standard library offers a function doing that in a more general way and returning something more informative than just a Boolean, but that needs to wait for more language features. Let us, however, bump our ambitions up just a notch, by setting to write a binarySearch that works not only with arrays of integers, but with arrays of any type as long as values of that type can be compared with '<'. It turns out that that's not much of a stretch. Here's what a generic binarySearch looks like:

import std.array;

bool binarySearch(T)(T[] input, T value) {
   while (!input.empty) {
      auto i = input.length / 2;
      auto mid = input[i];
      if (mid > value) input = input[0 .. i];
      else if (mid < value) input = input[i + 1 .. $];
      else return true;
   }
   return false;
}

unittest {
   assert(binarySearch([ 1, 3, 6, 7, 9, 15 ], 6));
   assert(!binarySearch([ 1, 3, 6, 7, 9, 15 ], 5));
}The (T) notation in binarySearch's signature introduces a type parameter T. The type parameter can then be used in the regular parameter list of the function. When called, binarySearch will deduce T from the actual arguments received. If you want to explicitly specify T (for example for double-checking purposes), you may write:

assert(binarySearch!(int)([ 1, 3, 6, 7, 9, 15 ], 6));which reveals that a generic function can be invoked with two pairs of parenthesized arguments. First come the compile-time arguments enclosed in !(...), and then come the run-time arguments enclosed in (...). Either or both sets of parentheses may be missing. Mixing the two realms together has been considered, but experimentation has shown that such a uniformization creates more trouble than it eliminates.

If you are familiar with similar facilities in Java, C#, or C++, you certainly noticed that D made a definite departure from these languages' use of angle brackets '<' and '>' to specify compile-time arguments. This was a deliberate decision aimed at avoiding the crippling costs revealed by experience with C++, such as increased parsing difficulties, a hecatomb of special rules and arbitrary tie-breakers, and obscure syntax to effect user-directed disambiguation.2 The difficulty stems from the fact that '<' and '>' are at their heart comparison operators,3 which makes it very ambiguous to use them as delimiters when expressions are allowed inside those delimiters. Such wannabe delimiters simply don't pair. Java and C# have an easier time exactly because they do not allow expressions inside '<' and '>', but that limits their future extensibility for the sake of a doubtful benefit. D does allow expressions as compile-time arguments, and chose to simplify the life of both human and compiler by extending the traditional unary operator '!' to binary uses and using the classic parentheses (which (I'm sure) you always pair properly).

Another detail of interest in binarySearch's implementation is the use of auto to leverage type deduction: i and mid have their types deduced from their initialization expressions.

In keeping with good programming practices, binarySearch is accompanied by a unit test. Unit tests are introduced as blocks prefixed with the unittest keyword (a file can contain as many unittests as needed, and you know how it's like—too many are almost enough). To run unit test before main is entered, pass the the -unittest flag to the compiler. Although unittest looks like a small feature, it is very handy for observing good programming techniques by making it so easy to insert small tests, it's embarrassing not to. Also, if you're a top-level thinker who prefers to see the unittest first and the implementation second, feel free to move unittest before binarySearch; in D, the semantics of a module-level symbol never depends on its relative ordering with others.

The slice expression input[a .. b] returns a slice of input from index a up to, and excluding, index b. If a == b, an empty slice is produced (and if a > b, an exception is thrown). A slice does not trigger a dynamic memory allocation; it's just an alias for a part of the array. Inside an index expression or a slice expression, $ stands in for the length of the array being accessed; for example, input[0 .. $] is exactly the same thing as input.

Again, although it might seem that binarySearch does a lot of array shuffling, no array is newly allocated; all of input's slices share space with the original input. The implementation is in no way less efficient than a traditional one maintaining indices, but is arguably easier to understand because it manipulates less state.

Counting Frequencies. Lambda Functions
Let's set out for another useful program: counting distinct words in a text. Want to know what were the most frequently used words in Hamlet? You're in the right place.

This program deals in associative array mapping strings to uints, and has a structure similar to the vocabulary building example. Adding a simple printing loop completes a useful frequency counting program:

import std.stdio, std.string;

void main() {
// compute counts
uint[string] freqs;
foreach (line; stdin.byLine) {
    foreach (word; splitter(strip(line))) {
      ++freqs[word];
    }
}
// print counts
foreach (key, value; freqs) {
    writefln("%6u\t%s", value, key);
}
}All right, now after downloading hamlet.txt off the Net, which is to be found (at the time of this writing) at http://www.cs.uni.edu/~schafer/courses/051/assign/labs/lab12/hamlet.txt, running our little program against The Bard's chef d'oeuvre prints:

1 outface
1 come?
1 blanket,
1 operant
1 reckon
2 liest
1 Unhand
1 dear,
1 parley.
1 share.
...which sadly reveals that output doesn't come quite ordered, and that whichever words come first are not quite the most frequent. This isn't surprising; in order to implements their primitives as fast as possible, associative arrays are allowed to store them internally in any order.

In order to sort output with the most frequent words first, you can just pipe the program's output to sort -nr (sort numerically and reversed), but that's in a way cheating. To integrate sorting into the program, let's replace the last loop with the following code:

// print counts
string[] words = array(freqs.keys);
sort!((a, b) { return freqs[a] > freqs[b]; })(words);
foreach (word; words) {
writefln("%6u\t%s", freqs[word], word);
}The function array takes the keys of the freqs associative arrays and puts them in array format, yielding an array of strings. The array is newly allocated, which is necessary because we need to shuffle the strings. We now get to the code:

sort!((a, b) { return freqs[a] > freqs[b]; })(words);which features the pattern we've already seen:

sort!(compile_time_arguments)(run_time_arguments);Peeling off one layer of parentheses off !(...), we reach this notation, which looks like an incomplete and anonymous function:

(a, b) { return freqs[a] > freqs[b]; }This is a lambda function—a short anonymous function that is usually meant to be passed to other functions. Lambda functions are so useful in so many places, D did its best at eliminating unnecessary syntactic baggage from defining a lambda: parameter types as well as the return type are deduced. This makes a lot of sense because the body of the lambda function is by definition right there for the writer, the reader, and the compiler to see, so there is no room for misunderstandings and no breakage of modularity principles.

There is one rather subtle detail to mention about the lambda function defined in this example. The lambda function accesses the freqs variable which is local to main, i.e. is not a global or a static. This is unlike C and more like Lisp, and makes for very powerful lambdas. Although traditionally such power comes at a runtime cost (by requiring indirect function calls), D guarantees no indirect calls (and consequently full opportunities for inlining) through a unique feature called local instantiation, which we'll discuss in Chapter ??.

The modified program outputs:

929 the
680 and
625 of
608 to
523 I
453 a
444 my
382 in
361 you
358 Ham.
...which is as expected, with commonly used words being the most frequent, with the exception of "Ham." That's not to indicate a strong culinary preference of Dramatis Personae, it's just the prefix of all of Hamlet's lines. So apparently he has some point to make 358 times throughout, more than anyone else. If you browse down the list, you'll see that the next speaker is the king with only 116 lines—less than a third of Hamlet's. And at 58 lines, Ophelia is downright taciturn.

1.5 Basic Data Structures
Now that we got into Hamlet, let's analyze the text a bit further. For example, for each Dramatis Persona, we'd like to collect some information, such as how many words they say in total, and how rich their vocabulary is. To do that, we need to be able to associate several data items with one persona.4 To group such information in one place, we can define a data structure as follows:

struct PersonaData {
   uint totalWordsSpoken;
   uint[string] wordCount;
}In D you get structs and then you get classes. They share many amenities but have different charters: structs are value types, whereas classes are meant for dynamic polymorphism and are accessed solely by reference. That way confusions, slicing-related bugs, and comments à la // No! Do NOT inherit! do not exist. When you design a type, you decide upfront whether it'll be a monomorphic value or a polymorphic reference. C++ famously allows defining ambiguous-gender types, but their use is rare, error-prone, and objectionable enough to warrant simply avoiding them by design.

In our case, we just need to collect some data and we have no polymorphic ambitions, so using struct is a good choice. Let's now define an associative array mapping persona names to PersonaData values:

PersonaData[string] info;and all we have to do is fill info appropriately from hamlet.txt. This needs some work because a character's paragraph may extend on several lines, so we need to do some simple processing to coalesce physical lines into paragraphs. To figure out how to do that, let's take a look at a short fragment from hamlet.txt, dumped verbatim below (with leading spaces made visible for clarity):

␣␣Pol. Marry, I will teach you! Think yourself a baby
␣␣␣␣That you have ta'en these tenders for true pay,
␣␣␣␣Which are not sterling. Tender yourself more dearly,
␣␣␣␣Or (not to crack the wind of the poor phrase,
␣␣␣␣Running it thus) you'll tender me a fool.
␣␣Oph. My lord, he hath importun'd me with love
␣␣␣␣In honourable fashion.
␣␣Pol. Ay, fashion you may call it. Go to, go to!Whether or not Polonius' enthusiasm about goto was a factor in his demise is, even to this day, a matter of speculation. Regardless of that, let's note how each character's line is preceded by exactly two spaces, followed by the (possibly contracted) character's name, followed by a period and a space, finally followed by the actual content of the line. If a logical line extends to multiple physical lines, the continuations are always preceded by exactly four spaces. We could do such simple pattern matching by using a regular expression engine (found in the std.regex module), but we want to learn arrays so let's match things "by hand." We only enlist the help of the Boolean function a.startsWith(b), defined by std.algorithm, which tells whether a starts with b.

The main driver reads input lines, concatenates them in logical paragraphs (ignoring everything that doesn't fit our pattern), passes complete paragraphs to an accumulator function, and at the end prints the desired information.

import std.algorithm, std.ctype, std.regex,
   std.range, std.stdio, std.string;

struct PersonaData {
   uint totalWordsSpoken;
   uint[string] wordCount;
}

void main() {
   // Accumulates information about dramatic personae
   PersonaData[string] info;
   // Fill info
   string currentParagraph;
   foreach (line; stdin.byLine()) {
       if (line.startsWith("   ")
             && line.length > 4
             && isalpha(line[4])) {
          // Persona is continuing a line
          currentParagraph ~= line[3 .. $];
       } else if (line.startsWith(" ")
             && line.length > 2
             && isalpha(line[2])) {
          // Persona just started speaking
          addParagraph(currentParagraph, info);
          currentParagraph = line[2 .. $].idup;
       }
   }
   // Done, now print collected information
   printResults(info);
}After we've equipped ourselves with information on how arrays work, the code should be self-explanatory, save for the presence of .idup. Why is it needed, and what if we forgot about it?

The foreach loop that reads from stdin deposits successive lines of text in the variable line. Because it would be wasteful to allocate a brand new string for each line read, the contents of line is reused every pass through the loop. As such, if you want to squirrel away the contents of a line, you better make a copy of it. Obviously currentParagraph is meant to indeed save text, so duplication is needed, hence the presence of .idup. Now, if we forgot .idup and subsequently the code would still compile and run, the results would be nonsensical and the bug rather hard to find. Having a part of a program modify data held in a different part of the program is very unpleasant to track down because it's a non-local effect (just how many .idups could one forget in a large program?). Fortunately, that's not the case because the types of line and currentParagraph reflect their respective capabilities: line has type char[], i.e., an array of characters that could be overwritten at any time; whereas currentParagraph has type string, which is an array of characters that cannot be individually modified. The two cannot refer to the same memory content because line would break the promise of currentParagraph. So the compiler refuses to compile the erroneous code and demands a copy, which you provide in the form of .idup, and everybody's happy. (The "i" in "idup" stands for "immutable.") On the other hand, when you copy string values around, there's no more need to duplicate the underlying data—they can all point to the same memory because it's known neither will overwrite it, which makes string copying at the same time safe and efficient. Better yet, strings can be shared across threads without problems because, again, there's never contention. Immutability is really cool indeed. If, on the other hand, you need to modify individual characters intensively, you may want to operate on char[], at least temporarily.

PersonaData as defined above is very simple, but in general structs can define not only data, but also other entities such as private sections, member functions, unittests, operators, constructors, and destructor. By default, each data member of a structure is initialized with its default initializer (zero for integral numbers, NaN for floating point numbers,5 and null for arrays and other indirect-access types). Let's now implement addParagraph that slices and dices a line of text and puts it into the associative array.

The line as served by main has the form "Ham. To be, or not to be-that is the question:" We need to find the first ". " to distinguish the persona's name from the actual line. To do so, we use the find function. a.find(b) returns the right-hand portion of a starting with the first occur-rence of b. (If no occurrence is found, find returns an empty string.) While we're at it, we should also do the right thing when collecting the vocabulary. First, we must convert the sentence to lowercase such that capitalized and non-capitalized words count as the same vocabulary element. That's easily taken care of with a call to tolower. Second, we must eliminate a strong source of noise: punctuation that makes for example "him." and "him" count as distinct words. To clean up the vocabulary, all we need to do is pass an additional parameter to split mentioning a regular expression that eliminates all chaff: regex("[ \t,.;:?]+"). With that argument, the split function will consider any sequence of the characters mentioned in between '[' and ']' as part of word separators. That being said, we're ready to do a lot of good stuff in just little code:

void addParagraph(string line, ref PersonaData[string] info) {
   // Figure out persona and sentence
   line = strip(line);
   auto sentence = line.find(". ");
   if (sentence.empty) {
      return;
   }
   auto persona = line[0 .. $ - sentence.length];
   sentence = tolower(strip(sentence[2 .. $]));   // Get the words spoken
   auto words = split(sentence, regex("[ \t,.;:?]+"));
   // Insert or update information
   auto data = persona in info;
   if (data) {
      // heard this persona before
      data.totalWordsSpoken += words.length;
      foreach (word; words) ++data.wordCount[word];
   } else {
      // first time this persona speaketh
      PersonaData newData;
      newData.totalWordsSpoken = words.length;
      foreach (word; words) newData.wordCount[word] = 1;
      info[persona] = newData;
   }
}The expression persona in info not only tells whether a given string is present as a key in an associative array, but also provides a handy pointer to the corresponding value in the array. In D there is no need (as it is in C and C++) to use '->' to access data referenced by a pointer—the regular field access operator '.' works unambiguously. If the key was not found, our code creates and inserts a brand new PersonaData, which concludes the addParagraph function.

Finally, let's implement printResults to print a quick summary for each persona:

void printResults(PersonaData[string] info) {
   foreach (persona, data; info) {
      writefln("%20s %6u %6u", persona, data.totalWordsSpoken,
         data.wordCount.length);
   }
}Ready for a test drive? Save and run!

     Queen 1104    500
       Ros   738    338
       For    55     45
      Fort    74     61
Gentlemen     4      3
     Other   105     75
      Guil   349    176
       Mar   423    231
      Capt    92     66
      Lord    70     49
      Both    44     24
       Oph   998    401
     Ghost   683    350
       All    20     17
    Player    16     14
      Laer 1507    606
       Pol 2626    870
    Priest    92     66
       Hor 2129    763
      King 4153   1251
Cor., Volt    11     11
Both [Mar     8      8
       Osr   379    179
      Mess   110     79
    Sailor    42     36
   Servant    11     10
Ambassador    41     34
      Fran    64     47
     Clown   665    298
      Gent   101     77
       Ham 11901   2822
       Ber   220    135
      Volt   150    112
       Rey    80     37Now that's some fun stuff. Unsurprisingly, our friend "Ham" gets the lion's share by a large margin. Voltemand's (Volt) role is rather interesting: he doesn't have many words to say, but in these few words he does his best in displaying a solid vocabulary, not to mention the Sailor, who almost doesn't repeat himself. Also compare the well-rounded Queen with Ophelia: the Queen has about 10% more words to say than Ophelia, but her vocabulary is about 25% larger.

As you can see, the output has some noise in it (such as "Both [Mar"), easy to fix by a diligent programmer and hardly affecting the important statistics. Nevertheless, fixing the last little glitches would be an instructive (and recommended) exercise.

1.6 Interfaces and Classes
Object-oriented features are important for large projects, therefore introducing them by means of small examples is at high risk of looking goofy. Add to that a pressing desire to stay away from overused examples featuring shapes, animals, or employees, and we're faced with quite a pickle. Oh, and there's one more thing—small examples usually gloss over the issue of polymorphic object creation, which is important. Talk about writer's block! Fortunately, the real world provides a useful example in the form of a problem that's relatively small, yet has no satisfactory procedural solution. The code we'll discuss below is the rewrite of a small useful awk script that had grown well beyond the implicit limits set by its design. We will work together toward an object-oriented solution that is at the same time small, complete, and elegant.

Consider writing a small statistics program called stats with a simple modus operandi: stats takes the statistical functions to compute as command-line parameters, gathers the numbers to operate on via standard input as a whitespace-separated list, and prints the statistical results one per line. Here is a sample session:

$ echo 3 5 1.3 4 10 4.5 1 5 | stats Min Max Average
1
10
4.225
$ _A quick-and-dirty script can perform such tasks no problem, yet the "dirty" tends to overshadow the "quick" as the number of statistical functions grows. So let's put together a better solution. For now, we start with the simplest statistical functions: minimum, maximum, and average. After figuring out an extensible design, the door is open to implementing more complex statistical functions.

A simple way to approach things is to just loop through the input and compute all needed statistics. This is not a scalable design because each time we need to add a new statistical function, we'd have to do surgery on existing code. The modifications will be nontrivial if we want to only do the computations asked for in the command line. Ideally, we'd confine each statistical function to one contiguous piece of code. That way, we can add new functionality to the program by simply appending new code—the Open-Closed principle [3] at its best.

Such an approach entails figuring out what all, or at least most, statistical functions have in common, to the end of manipulating them all from one place and in an uniform manner. Let's start by remarking that Min and Max take their input one number at a time, and have the result handy as soon as the input is finished. The final result is only one number. In addition, Average must do a post-processing step (divide the accumulated sum by the number of inputs). Moreover, each algorithm maintains its own state. When different computations obey a uniform interface and need to keep state, it makes sense to make them objects, and define a formal interface to manipulate any and all of them.

interface Stat {
   void accumulate(double x);
   void postprocess();
   double result();
}An interface defines a required behavior as a set of functions. Of course, anyone claiming to implement the interface must define all functions as specified by their declarations. Speaking of implementation, let's see how we can define Min to obey Stat's iron fist:

class Min : Stat {
   private double min = double.max;
   void accumulate(double x) {
      if (x < min) {
         min = x;
      }
   }
   void postprocess() {} // nothing to do
   double result() {
      return min;
   }
}Min is a class—a user-defined type that brings lots of object orientation goodies into D. Min manifestly implements Stat through the syntax class Min : Stat, and indeed defines Stat's three functions exactly with the same arguments and return types (otherwise the compiler would not have let Min get away with it). Min keeps only one private member variable min, which is the smallest value seen so far, and updates it inside accumulate. The initial value of Min is the largest possible number, such that the first input number will replace it.

Before defining more statistical functions, let's write a driver for our stats program that reads the command line parameters, creates the appropriate objects to do computations (such as Min when Min is passed in the command line) and uses the objects through the interface Stat.

import std.contracts, std.stdio;

void main(string[] args) {
   Stat[] stats;
   foreach (arg; args[1 .. $]) {
      auto newStat = cast(Stat) Object.factory("stat." ~ arg);
      enforce(newStat, "Invalid statistics function: " ~ arg);
      stats ~= newStat;
   }
   for (double x; readf(" %f ", &x) == 1; ) {
      foreach (s; stats) {
         s.accumulate(x);
      }
   }
   foreach (s; stats) {
      s.postprocess();
      writeln(s.result());
   }
}This program does quite a lot but is only one mouthful. First off, main has a signature different from what we saw so far—it takes an array of strings. The D runtime support initializes the array from the command line parameters. The first loop initializes the stats array from args. Given that in D (as in C and C++) the first argument is the name of the program itself, we skip that first argument by taking the slice args[1 .. $]. We now hit the statement

auto newStat = cast(Stat) Object.factory("stat." ~ arg);which is quite long, but, to quote a sitcom cliché, "I can explain." First, '~', when used as a binary operator, concatenates strings, so if the command-line argument was Max, the concatenation results in the string "stat.Max", which is passed to the function Object.factory. Object is the root of all class objects, and it defines the static method factory that takes a string, looks up a little database built during compilation, magically creates an object of the type named by the passed-in string, and returns it. If the class is not present, Object.factory returns null. For that call to succeed, all you need is have a class called Max defined somewhere in the same file. Creating an object given the name of its type is an important facility with many useful applications, so important in fact that some dynamic languages make it a central feature; languages with a more static approach to typing need to rely on runtime support (such as D or Java), or leave it to the programmer to devise a manual registration and discovery mechanism.

Why stat.Max and not just Max? D is quite serious about modularity so it does not have a global namespace in which anyone can put anything. Each symbol lives in a named module, and by default the name of the module is the base name of the source file (behavior that can be overridden by putting a directive module blah; at the top of the file). So given that our file is called stats.d, D reckons that every name defined in that file belongs to module stats.

There is one more hitch left. The static type of the just-obtained Min object is actually not Min. That sounds dumb, but it's justified by the fact that you could create any object by invoking Object.factory("whatever"), so the return type better be some common denominator of all possible object types—Object, that is. To get the appropriate handle on the newly-created object, we must make it into a Stat object, operation known as casting. In D, the expression cast(T) expr casts expression expr into type T. Casts involving class and interface types are always checked, so our code is foolproof.

Looking back, we notice that we've done a lot of solid work in main's first five lines. That was the hardest part, because the rest of the code writes itself. The second loop reads one number at a time (readf takes care of that) and calls accumulate for all statistical objects. The readf function returns the number of items read successfully according to the specified format (in our case, " %f " which means one floating-point number surrounded by any amount of whitespace). Finally, the program prints all results.

More Stats Implementations. Inheritance
Implementing Max is as trivial as implementing Min; aside from a slight change in accumulate, everything is exactly the same. Whenever a new task looks a lot like an old one, "interesting" and not "boring" is what should come to mind. A repetitive task is an opportunity for reuse, and rightly languages that can better exploit various flavors of similarity should rate higher on a certain quality scale. What we need to figure is the particular kind of similarity that Min and Max (and hopefully other statistical functions) enjoy. Thinking through, it looks like they both belong to the kind of statistical functions that (1) build their result incrementally, and (2) need only one number to characterize the result. Let's call this category of statistical functions, incremental functions.

abstract class IncrementalStat : Stat {
   protected double _result;
   void accumulate(double x);
   void postprocess() {}
   double result() {
      return _result;
   }
}An abstract class can be seen as a partial commitment: it implements a number of methods, but not all, and as such cannot work standalone. The way to materialize an abstract class is to inherit it and complete its implementation. IncrementalStat takes care of Stat's boilerplate code but leaves accumulate to be implemented by the derived class. Here's how the new Min looks like:

class Min : IncrementalStat {
   this() {
      _result = double.max;
   }
   void accumulate(double x) {
      if (x < _result) {
         _result = x;
      }
   }
}Class Min defined a constructor, too, in the form of a special function called this(), needed to initialize the result appropriately. Even with the constructor in place, the resulting code marks good savings from the initial state of affairs, particularly if we take into account the fact that many other statistical functions follow a similar pattern (e.g. sum, variance, average, standard deviation). Let's look at implementing average, because it's a great occasion to introduce a couple of more concepts:

class Average : IncrementalStat {
   private uint items = 0;
   this() {
      _result = 0;
   }
   void accumulate(double x) {
      _result += x;
      ++items;
   }
   override void postprocess() {
      if (items) {
         _result /= items;
      }
   }
}First off, Average introduces one more member variable, items, which is initialized with zero through the syntax "= 0" (just to showcase initialization syntax, but redundant in this case because integral types are zero-initialized anyway as discussed on page 17). Second, Average defines a constructor that sets result to zero; this is because, unlike minimum or maximum, the average of zero numbers is defined to be zero. Although it might seem that initializing result with NaN just to overwrite it later with zero is needless busywork, optimizing away the so-called "dead assignment" is low-hanging fruit for any optimizer. Finally, Average overrides postprocess even though IncrementalStat already defined it. In D, by default, you can override (inherit and redefine) member functions of all classes, with the remark that you must specify override so as to avoid various accidents (e.g. failing to override due to a typo or a change in the base type, or overriding something by mistake). In C++ lingo, member functions are virtual by default. If you prepend final to a member function, that prohibits derived classes from overriding the function, effectively stopping the virtual mechanism.

1.7 Values vs. References
Let's run a simple experiment:

import std.stdio;

struct MyStruct {
   int data;
}
class MyClass {
   int data;
}

void main() {
   // play with a MyStruct object
   MyStruct s1;
   MyStruct s2 = s1;
   ++s2.data;
   writeln(s1.data); // prints 0
   // play with a MyClass object
   MyClass c1 = new MyClass;
   MyClass c2 = c1;
   ++c2.data;
   writeln(c1.data); // prints 1
}It seems like playing with a MyStruct is quite a different game than playing with a MyClass. In both cases we create a variable that we copy into another variable, after which we modify the copy (assuming '++' is modifying). The experiment seems to reveal that after a copy, c1 and c2 refer to the same underlying storage, while on the contrary, s1 and s2 have independent lives.

The behavior of MyStruct obeys value semantics: each variable refers to exactly one value, and assigning one variable to another really means copying the state of the variable over the state of the other variable. The source of the copy is unchanged, and the two variables continue to evolve independently. The behavior of MyClass obeys reference semantics: values are created explicitly (in our case by invoking new MyClass) and assigning a class variable to another simply means that the two variables refer to the same value.

Value semantics are easy to deal with, simple to reason about, and allow efficient implementation for small sizes. On the other hand, nontrivial programs are difficult to implement without some means to refer to a value without copying it. Value semantics alone preclude, for example, forming self-referential types (lists or trees), or mutually-referential structures such as a child window knowing about its parent window. Any serious language implements some sort of reference semantics; it could be argued that it all depends on where the default is. C has value semantics exclusively and allows forming references explicitly, by means of pointers. In addition to pointers, C++ also defines reference types. Interestingly, pure functional languages are at freedom to use reference or value semantics as they find fit, because user code cannot tell the difference. This is because pure functional languages don't allow mutation, so you can't tell if they snuck a copy of a value or just a reference to it—it's frozen anyway, so you couldn't verify whether it's shared by changing it. On the contrary, pure object-oriented languages are traditionally mutation-intensive and employ reference semantics exclusively, some to the extent of allowing a disconcerting amount of flexibility such as changing system-wide constants dynamically. Finally, some languages take a hybrid approach, embracing both value and reference types, with various levels of commitment.

D makes a systematic approach to the hybrid method. To define reference types you use class. To define value types or hybrid types you use struct. As chapters ?? and ?? (respectively) describe in detail, each of these type constructors is endowed with amenities specific to this fundamental design choice. For example, structs do not have support for dynamic inheritance and polymorphism (the kind we've shown in the statistical program above), as such behaviors are not compatible with value semantics. Dynamic polymorphism of objects needs reference semantics, and any attempt to mess with that can only lead to terrible accidents. (For example, a common danger to watch for in C++ is slicing, i.e. suddenly stripping the polymorphic abilities of an object when inadvertently using it as a value. In D, slicing could never occur.)

A closing thought is that structs are arguably a more flexible design choice. By defining a struct, you can tap into any semantics that you want, be it eager-copy value, lazy copying à la copy-on-write or reference counting, or anything in between. You can even define reference semantics by using class objects or pointers inside your struct object. On the other hand, some of these stunts may require quite advanced technical savvy; in contrast, using classes offers simplicity and uniformity across the board.

1.8 Templates
Templates (code that generates code) usually fall in the advanced techniques category, and rightly so. Yet, D's approach to templates is so down-to-earth, and programming with D's templates is so exhilarating, it would be unjust to introduce D without giving a flavor of what its templates look like.

Templates started out in C++ as parameterized types (useful for things like generic containers and algorithms), but soon their use has grown well beyond that, getting into full-blown code generation. Harnessing such power is as much a black art as is mastering LISP macros—another generative technique. To wit, some later languages (such as Java or C#) have given up the code generation aspect, focusing exclusively on the easily tamed parameterized types à la "container of T." D's approach to templates is unleashing full-blown code generation, a symbolic approach to parameterization (code can be parameterized on any symbol, not only types), and thorough compile-time introspection. We will get into all the juicy details in good order in Chapter ??. For now, let's just take a quick look at some basic parameterized code. It's hard to make a good choice when it comes about such a comprehensive feature, so let's set out for no more and no less that Stepanov's three litmus tests.

Stepanov's Litmus Tests
In an interview [5], Alexander Stepanov, the creator of C++'s Standard Template Library (STL), mentions:

These are my litmus tests: if a language allows me to implement max and swap and linear search generically—then it has some potential.
To clarify, generic implementation means that you implement the operation once for all types, and that the result is as good as a hand-written implementation tuned for each pertinent type in terms of safety, convenience, and last but not least, performance. Costly abstractions are a dime a dozen.

Stepanov makes the point that these three are very simple and fundamental algorithmic primitives, and that a language with a claim to fame should implement them with ease and panache. To further clarify the requirements, the charge is to implement the following three primitives:

1.Write a generic function to compute the maximum of n ≥ 2 numbers. The function should apply to all ordered types.
2.Write a generic function to swap two values efficiently (e.g. constant memory consumption).
3.Write a generic function to linearly search for an item in any collection. The collection can be built-in or user-defined following whatever protocol the language prescribes for collections.
All implementations must be as efficient as their manually-written and inlined implementation. Let's set out to implementing the three functions in D.

--------------------------------------------------------------------------------

© 2009 Pearson Education, Inc. Informit. All rights reserved.

800 East 96th Street Indianapolis, Indiana 46240

分享到：

再学SQLite3 API | 练习：boost.timer 转D2

2009-08-04 16:23
浏览 1549
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论