grid - SXSW Action - Stats Discussion

BROWSE

	Community Home
	Advocacy
	Help
	Projects
	Education
	Development
	GR Café

DOCUMENTATION

DISCUSSIONS

FILE ROOM

SXSW Action - Stats Discussion

Matt

ID: 21503
Posts: 326

Please use this thread to discuss design and development of stat libraries for SXSW Action...

(Friday, February 03 12, 12:33 AM)

Aaron Schumacher

ID: 71842
Posts: 6

Is there a discussion here? Are there existing posts? This interface is so strange; I was expecting it to work something like a wiki, I guess... Well, here's an email I just sent, maybe folks on this discussion thing can help too:

Hi! I'm just starting to look at this, and my first observation is that the web site isn't very conducive to collaboration. For example, clicking "comment" on the "SXSW Action - Statistics" page seems to fail. I was able to comment on another page, but other poster's comments were largely unreadable and I don't know if anyone will put up with the interface.

The comment I had wanted to put up was about the available data set being just 2074 records, when it says there are tens of thousands available. It seems like the whole thing could easily be released.

A more substantial comment is that I'm not sure I understand what the goal of the project is. We're just trying to visualize the logs of an evolutionary model optimizer? Why? Is there real hope that you can make a better model by presenting a bunch of unlabeled parameters and looking for input from "human intuition"?

What would success in this project look like? What is the goal? To make it a slightly clearer question, is the goal purely to create a visualization? Is the goal to create a system through which humans can specify a set of new parameter values to try?

It says that you hope people will produce some sort of software library - what would such a library actually do? What would be the features of such a library?

Thanks so much for helping me to understand the needs of the project,

- Aaron

(Friday, February 03 12, 9:38 PM)

Matt

ID: 21503
Posts: 326

Hi Aaron.

There is a pretty "normal" (i think) wiki -- an overview of the project is here. From that page there is a link to another wiki page, wherein I was hoping maybe people would help build out a description of alternative approaches. (ie, maybe people who don't have time to code, might at least comment on general directions.)

(Friday, February 03 12, 9:48 PM)

Matt

ID: 21503
Posts: 326

More broadly --

Yes, I think your description of the project is close… You write:

"We're just trying to visualize the logs of an evolutionary model optimizer? Why? Is there real hope that you can make a better model by presenting a bunch of unlabeled parameters and looking for input from "human intuition"?

So: if we had a handful of parameters, 3 or 5, then your description would be spot on: we'd just want to visualize the logs; and yes, the thinking is that a person, on looking at such a presentation, could see trends; and so a person could "jump ahead" of the evolutionary algorithm, which plod along in very small steps.

By way of imagining what I mean: let's say the results could be plotted on a 2D chart… a thousand initial runs would produce a fairly random scatter… subsequent runs would begin to cluster; you'd see small circles of growing density, where the evolutionary optimizer would run subsequent generations in the vicinity of those that had worked in the past.

So the idea is that a person could see such general patters emerging; ie the circle of results forming; and a person could then "stick a pin" in the middle of that circle, on the basis of the reasonable assumption that that's where things are going. And many times, this will save a number of generations of iterations. And since in our case model runs are computationally expensive, this would be a significant result.

(Friday, February 03 12, 10:09 PM)

Aaron Schumacher

ID: 71842
Posts: 6

Yes, I think I've seen all the pages. When you click "comment" from the project description page, it says this:

"You've followed a link to a page that doesn't exist yet. To create the page, start typingg in the box below. If you are here by mistake, just click your browser's back button."

Does that just mean that there aren't any comments yet?

And I've seen the page for possible approaches. I have two main thoughts:

1) It will be easier for people (at least for me) to think about possible approaches if the overall goal of the project is more clear. What are we trying to do? What would a successful approach lead to? (See my earlier post.)

2) Can we get a complete data set?

(Friday, February 03 12, 10:10 PM)

Matt

ID: 21503
Posts: 326

But here is the second layer of complication: there are not one or two or three or five parameters in our dataset to present to the user. There are 25. And we can't think, for now, of how to present a 25-dimensional dataset to a person.

And so the thought is to see if we (and by this I mean the "you" of "we") can develop methods to reduce the number of parameters presented to the user.

Think of a machine that has 25 levers (the parameters). What we need is an "interface" of some 5 levers or so: the user works with these 5, which are linked to the underlying 25 in some ordered way; ie when the user sets the 5 levers, this input can be meaningfully passed ("translated") to the underlying machine's 25 levers…

Not sure if that's the best description; but maybe it helps?

(Friday, February 03 12, 10:11 PM)

Matt

ID: 21503
Posts: 326

Returning to the original question: we want to produce a range of visualizations; and then we want to create associated tools for the user to predict where next to search next.

(Friday, February 03 12, 10:11 PM)

Matt

ID: 21503
Posts: 326

(*I'll see if we can get a larger dataset; the data posted was intended as a sort of sample, to illustrate what's available.)

(Friday, February 03 12, 10:12 PM)

Aaron Schumacher

ID: 71842
Posts: 6

Cool, thanks for your responses! I'd just feel better looking at a complete data set, so that I don't feel like I could be chasing things that aren't really there.

More questions:

* The data set has parameters numbered 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 24, 25, 26, 27, 28, 29, and 30. What's going on? Is this purely a strange numbering, or did someone already decide that parameters 1 and 2, for example, were no good?

* Are the iterations numbers accurately representing ordered runs by the evolutionary algorithm? In many cases (like parameter_5) the choices seem to be actually getting more wide-spread over time, rather than narrowing in on an ideal value. Or is this just a hint that this parameter is not particularly important?

* I haven't read the linked papers - do they describe the particular model being evaluated here? It might be helpful to have some idea of how the parameters interact in the model.

(Friday, February 03 12, 10:18 PM)

Matt

ID: 21503
Posts: 326

Aaron -- seems our posts were more or less simultaneous; let me know if the above gives you better insight into what we're trying to accomplish.

But to describe in one other way:

We aim to create a set of computer games. And the idea of the games is to present information to users sufficient for them to make a good "guess" about optimal model parameters. And to your point above, the thought is that people will be able to do this better than an evolutionary algorithm alone.

(*In large part we think people will do it better because we can't afford to let the evolutionary algorthm churn though billions of iterations-- that's too expensive, even taking into account the fact that there are tens of thousands of PCs contributing to the effort.... So the objective is to bring some human intelligence into the mix, to guide the process a bit...)

(Friday, February 03 12, 10:20 PM)

Matt

ID: 21503
Posts: 326

Let me see if I can get Nick, who knows more about the specifics of the model and data to comment here on your last post

(And then let's remember to put any details he can provide into the data documentation here.)

(Friday, February 03 12, 10:23 PM)

Matt

ID: 21503
Posts: 326

* One last thing -- small point -- the result you got on clicking the "comment" button is in fact just an awkward way of saying that page is at present empty. You could put stuff in there if you wanted; but i wouldn't bother -- i think discussions are a better format for comments.

(Friday, February 03 12, 10:25 PM)

Nicolas Maire

ID: 27808
Posts: 1

Aaron,

Thanks for your intersest and questions. I've quoted all that seems relevant from your posts below and put my answers in between.

The objective is to improve convergence rate of the optimization process. We hypothesize that a human-aided approach would help with this. If an improved algorithm achieves the same goal (we tried various alternatives), this would also be OK.

2) Can we get a complete data set? (The comment I had wanted to put up was about the available data set being just 2074 records, when it says there are tens of thousands available. It seems like the whole thing could easily be released.)

The reasons for the small number of records in the example dataset: The data comes from an attempt to parameterize a new model, started just recently, and it is just one model variant of many that we are are about to fit. We can provide the current full dataset if this would be useful at this point.

Yes, these ids are ordered by sampling order.I don't have a good answer to the second question, just to say that the current algorithm does no not vary the sampling range of parameter values as a function of iteration number.

* I haven't read the linked papers - do they describe the particular model being evaluated here? It might be helpful to have some idea of how the parameters interact in the model. The data set has parameters numbered 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 24, 25, 26, 27, 28, 29, and 30. What's going on? Is this purely a strange numbering, or did someone already decide that parameters 1 and 2, for example, were no good?

This publication (http://www.ajtmh.org/content/75/2_suppl/1.full) gives an overview of the baseline model, and Table 1 lists the parameters to be estimated. I can provide a partial mapping of the numbers in the dataset to these parameters if this would be helpful (some new parameters have been added in the model variants we're currently fitting, and others are not relevant for the new models)

Hope this helps,

Nick

(Tuesday, February 07 12, 7:16 PM)

Matt

ID: 21503
Posts: 326

(*I copied some of the QA above into the "Sample Data" page.)

(Tuesday, February 07 12, 8:24 PM)

Aaron Schumacher

ID: 71842
Posts: 6

I went ahead and made a bunch of graphs, available here:

https://plus.google.com/photos/112658546306232777448/albums/5708466103108764705

There are a lot of them, and they show what we already knew, which is that the relationships between the model parameters and the loss function are pretty complicated.

The one exception seems to be parameter 21, where it seems that all the really awful loss function values come from having that parameter set on the low end of its scale.

More patterns may come out of looking at just the better parameter combinations (not including really high loss function value rows). Perhaps I'll do that too.

From a theoretical visualization standpoint, I really don't know how to show with a graphic more complex interactions between three or more parameters, in any convenient way... I made all the two-parameter graphs, which is already too many to have to visually inspect, really. (I mean I looked at them all, but I don't know how much I got from it.) I think there may be an interesting problem here, and I'm not sure if I'm missing good existing work on it. I'd like to see some. Not sure if I'll be able to do much myself right now.

(Monday, February 13 12, 4:27 AM)

Page 1 of 2

Please enable JavaScript support in Your web browser to display this page property.