Is Bloat Due to Introns?

W. B. Langdon and W. Banzhaf

Computer Science, University College, London

Gower Street, London, WC1E 6BT

Department of Computer Science, University of Dortmund

D-44221 Dortmund

1. Fitness Based on Syntax

Syntax of tree (i.e. its sequence of functions and leafs) is reduced to hash code.

This is randomised to give a fitness value.

NB fitness depends only on syntax, each tree always has same fitness, many trees share the same fitness value.

Traverse tree depth first, i.e. 208042962,34611749.
Each syntax element has an opcode.
The opcodes are packed tightly into 32 bit integers.
The first nine opcodes are packed into one integer and the rest into a second.
These are combined using XOR,
208042962 xor 3461749 = C667BD2₁₆ xor 2102225₁₆ = E7659F7₁₆ = 242637303.
242637303 is the hash value of the tree.
242637303 is randomised by Park-Miller to yield 65D2E653₁₆
65D2E653₁₆ has 17 bits set and so the fitness of the tree is 17.

1.1 Introns

If using introns, the whole of subtrees starting with ``1'' are replaced by their first terminal.

So 208042962,34611749 becomes 208042962,3467.

XOR becomes C667BD2₁₆ xor D8B₁₆ = C667659₁₆

and randomising yields 4AA4B33₁₆, which has 13 bits set.

1.2 Fitness Based on Building Blocks

Fitness given by combining ``behaviour'' of every subtree in the program.

``Behaviour'' of whole tree is union of bits set by each subtree.
Fitness is the number of bits set in the union.

Which bit is set by a subtree is given by calculating is fitness (using algorithm on page -).

Easy - Uniform

Hard - Binomial distribution

2. Building Block Fitness Distributions

Initially bigger trees have an advantage
but if size size >200 all same fitness.

Slow change in distribution w.r.t. size for big trees

3.1 Results - Random - Anti-bloat

Population quickly reaches a fitness plateau.
Followed by convergence towards a small individual with high fitness.
Stable (1000 generations) populations formed by its children.

3.2 Results - Easy Building Blocks

GP rapidly finds trees with maximum fitness.
These spread throughout the population but each generation contains some small low fitness children.
Positive correlation between size and fitness drives increase in size (Price's theorem).
Bloat continues even when most programs in the search space have maximal fitness.

3.3 Results - Hard Building Blocks

Fitness continues to improve during run but each improvement takes longer to find.
Population converges towards current highest fitness but short and low fitness children are produced in each generation.
Positive correlation between size and fitness drives increase in size.

3.4 Linear Increase in Depth

With bushy initial trees, (except for random and easy problems) average program height increases roughly linearly by about one level per generation.
The non-linear, slower depth increase in the easy search spaces is produced by reduced selection pressure.
Other types of initial population increase their depth faster than those starting with bushy trees.

3.5 Results - Evolution of Shape

Populations evolve away from the bushy trees created by ``ramped half-and-half'' towards the most popular part of the search space.

3.6 Results - Sub Quadratic Power law

Mean of 10 runs. Power law fit of mean program size v. gen.

Problem Final population Depth Power law

Fitness mean size mean per gen Exponent

Random 25 (1) 6 (0.9) 3

intron 25 (1) 530 (200) 33 0.6 (0.2) 1.23 (.09)

BBlock (easy) 32 (0) 870 (100) 33 0.4 (0.1) 1.02 (.12)

BBlock (hard) 26 (.8) 9000 (4600) 132 2.8 (1.7) 1.34 (.19)

Since programs remain close to the ridge and they increase their depth at a constant rate this leads to sub-quadratic growth in their size.

4. Conclusions

Linear GP on totally random landscape does not bloat without introns.

Tree GP on totally random landscape does not bloat without introns but is bloat due to introns?

Introduction of correlation (via building blocks or introns) to random landscape and bloat occurs regardless of presence of introns.

Regard bloat as random diffusive expansion of population into available search space. Most of search space is occupied by big random programs (cf. ridge).

GP behaves on these artificial problems like it does on real program spaces (i.e. with semantics).

C++ code may be found in ftp://cs.ucl.ac.uk/wblangdon/gp-code

About this document ...

Next: About this document ...
Bill LANGDON
2000-09-23 (14 Dec 2005)

Problem	Final population				Depth			Power law
	Fitness		mean size		mean	per gen		Exponent
Random	25	(1)	6	(0.9)	3
intron	25	(1)	530	(200)	33	0.6	(0.2)	1.23	(.09)
BBlock (easy)	32	(0)	870	(100)	33	0.4	(0.1)	1.02	(.12)
BBlock (hard)	26	(.8)	9000	(4600)	132	2.8	(1.7)	1.34	(.19)