Programming and writing about it.

echo $RANDOM

Tag: pyevolve

PiCloud + Pyevolve: Evolutionary Algorithms in the Cloud

It was quite simple to get Pyevolve running up on PiCloud.

For an Evolutionary Algorithm, there are a number of ways it can be parallelized. One of the easiest things to do is to run parallel instances of the algorithm with different initial random seeds. (This is a common practice in the research community – since starting with different random seeds allows testing of the robustness of a new algorithm or the correct  implementation of an existing one). The first exercise that I tried with Pyevolve + PiCloud is to run Pyevolve’s Genetic Algorithm implementation with 10 different initial seeds on PiCloud  – all running in parallel. Here’s how, you can go about it:

Installing Pyevolve

  1. Clone the git repository from https://github.com/perone/Pyevolve
  2. Install Pyevolve: $ sudo python setup.py install
  3. Just to check if it has been installed properly, just run one of the examples in the examples/directory

Setting up Picloud

  1. To setup PiCloud, please follow the instructions at http://www.picloud.com/ starting with registering for a free account
  2. Once you have installed the PiCloud client, please verify the installation by typing ‘import cloud’ from your Python prompt and see that it doesn’t throw up any errors

Getting Ready to Launch

For the purpose of this experiment, I shall be using one of the examples that ship with Pyevolve: pyevolve_ex7_rastrigin.py after some modifications. The modified file is here. Specifically, we change the main() function to run_ga() taking two parameters: seed and runid. The seed parameter is used to provide a random initial seed (used when creating the GA engine using ga=GSimpleGA.GSimpleGA(genome,seed)). The CSV adapater is initialized to store the statistics in a CSV file, where the runid file is used for specifying the name of the file.

Okay, now I shall describe the main driver script which I used to run this GA in the cloud. Here it is:


#!/usr/bin/python
# Simple demo to show how to run the Pyevolve
# Evolutionary Algorithms framework on PiCloud
# Pyevolve: http://sourceforge.net/projects/pyevolve/
# PiCloud: https://www.picloud.com/

# Amit Saha
# http://echorand.me

from pyevolve_rastrigin import *
import cloud

cloud.setkey(<API KEY>, 'SECRET KEY')

# List of Random seeds and run-Ids
# assuming 10 runs
seed_list=[100*(i+1) for i in range(10)]
runid_list=[i+1 for i in range(10)]

# calls the method defined in pyevolve_rastrigin.py
# which initiates the GA execution.
# Execute the code on PiCloud
jids = cloud.map(run_ga,seed_list,runid_list)

# check if the jobs are complete, if yes
# pull the stat files
cloud.join(jids)
for i in range(10):
    cloud.files.get('stats_' + str(i+1) + '.csv','stats_' + str(i+1)+'.csv')

One of the first things you need to do is set the API Key and Secret key that you will be using to access PiCloud. The API Key and the Secret key can be seen in the “API Keys” page of your PiCloud account. This is done by the line cloud.setkey(, 'SECRET KEY'). Next, we use the cloud.map() function to call the run_ga function for each pair of the values of the lists seed_list and runids_list. This is a efficient way of running the same function in the cloud, but with a different set of parameters. Once that is launched, you can see the state of your jobs from your PiCloud’s account’s Jobs page.

Next, what we do is we wait for all the jobs to finish using cloud.join(jids) and then pull all the statistic CSV files from PiCloud’s storage to the local file syste using cloud.files.get('stats_' + str(i+1) + '.csv','stats_' + str(i+1)+'.csv') (More information on using the cloud.files module is here). Once the files have been copied, you can see the results from the CSV files – each CSV file representing the result of one run of the GA. We have not yet talked about how the CSV files were created, however.

Creating files on PiCloud from Pyevolve

The source code of CSV adapater, where the CSV files are created is in pyevolve/DBAdapters.py. The open() method of class DBFileCSV is where the file is opened for writing and the close() method of the same class closes the file handle when the writing is finished. However, this method of creating files won’t work for PiCloud – rather the data won’t be retrievable for our client. We have to create the file the PiCloud way – which is to use the cloud.files.put() function. All I did here was add the line cloud.files.put(self.filename) in the close() method. And I reinstalled Pyevolve module and it all worked fine. You may find my modified DBAdapaters.py file here.

Conclusion

I hope to discuss with the Pyevolve folks what they think of my changes to the DBAdapter class and see if they suggest a better way. Like I mentioned in the beginning, this is a very naive way of harnessing the power of parallel computing in Evolutionary Algorithms. I hope to explore more in this direction with PiCloud. If you have any queries, please leave a comment.

Thank you for reading.

Advertisement

GAs with a starting point?

Local search methods in Optimization problems typically start with an initial solution. It then iteratively explores around in its neighborhood and moves to a better solution, if found. This better solution found now becomes the starting point of the next iteration.  When a better solution is not found, the iteration stops and usually we have the solution we are looking for.

Couple of things need some explanation, although brief:

  • Definition of the neighboring solution: This is a problem dependent definition. If the problem is to find a optimal solution to the TSP problem, a neighboring solution would be a neighboring node. If the problem is to find a local minima of a function having a continuous search space, then a slight pertrubation in each components of the solution can be a new neighboring solution (as done in Hooke-Jeeve’s pattern search)
  • The next better solution may  be done deterministically by examining all the solutions in a neighborhood such as in Hooke-Jeeve’s pattern search and Hill Climbing or stochastically as in Simulated annealing

Genetic Algorithms on the other hand do not have such a definite and specified starting point, but a random starting population of individuals where we may have some potentially optimal solutions which will finally give us our desired solution after the GA iterations are over.

Recently, David on the pyevolve mailing list posted a query asking if there was a way if pyevolve allowed to choose a starting population, (not individual), since he had a fair idea of the solutions he was looking for but wanted to optimize it further, if possible. Pyevolve, doesn’t yet have a API for such a use case. I blatantly, foolishly replied with my thought that it’s not possible in population based method. Not possible? WTF WTF. I mixed up what Pyevolve doesn’t allow and we actually cannot.

On further thought, why not! Say we know that our 2-variable problem has (a , b) as a possible good solution and we want to know if further improvement is possible.

Now assuming that we want a population of size 100 for our current problem, we can simply choose our initial population to be around the point (a , b). How? A very simple way to do is to consider a circle with (a , b ) as the center and of some arbitrary radius. Now we can easily find a way to find 25 (=100/4) solutions from each quadrant of the circle. That way, we have a fairly uniform representation of the neighborhood around (a , b).

For example:

Representative solutions around a starting point

The above figure shows a circle centered at (0.5,0.6) considered as a starting point of an example problem and the representative 100 individuals chosen from its neighborhood.

I tried the above approach with a couple of 2 variable functions, one of them being the Himmelblau function, often used as a test function in classical optimization literature. With different starting points, I could get a Real coded GA to converge to the correct solution.  These are just some initial results and probably the utility of such a approach is to be explored fully.

The key point here is to start with a initial population around a possibly good solution, rather than a completely random one.