ATLAS: find Higgs, win $7k

Particle physics meets the machine learning sport

Amateur and experienced programmers, you have a chance to win $7,000 (gold), $4,000 (silver), or $2,000 (bronze) if you succeed in a contest organized by the LHC's ATLAS Collaboration (via Tommaso Dorigo),

Higgs Boson Machine Learning Challenge (kaggle.com)

So far there are 180+ contestants (well, teams – a team may contain at most 4 people). Anyone who registers and sends her results by September 15th, 2014 may win, however.

What is the sport about?

You have to download 55 megabytes of data in four files and write a program – assuming that you won't be able to classify all the data on the top of your head, or on the back of the envelope – that is able to classify some event (a proton-proton collision) as "s[ignal]" (super-exciting) or "b[ackground]" (boring).

You and your computer may find the right way to label the event as "s" or "b" by looking at 250,000 events which already have the right "s" and "b" labels attached to them. Yes, the contestants who don't like computers will have to cut a forest to obtain a sufficient amount of paper for that. Send your results.

I will be much more specific for you. An event looks like this:

100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.00265331133733,s

You see the ID 100,000 (the numbers from 100k below 350k are training events which are identified by "s" or "b" at the end; those from 350k below 900k are contest events in the "test" file) followed by 30 reasonable-accuracy real numbers and, in the case of the training events, also 1 insanely precise number (a "weight": close to zero for "s", greater than about one for "b") as well as with an "s/b" classification (a "label"). Out of the 250,000 training events, 85,667 are "s" (signal), slightly more than 1/3. (This information, as well as some other comments below, reveals results of my preliminary "research" of the files.)

Your final submission (up to 5 submissions per day are OK) is a CSV file of this form:

EventId,RankOrder,Class
350000,416957,b
350001,89624,b
...
899999,254659,b

The rank order is a number between 1 and 550,000 you calculate – 1 is the most background-like event according to you, 550,000 is the most signal-like event. I think that only the s/b answers are actually used to pick the winners.

Once you train yourself or your programs to decide whether an event is an "s" or a "b", you should apply it to 550,000 contest events that are not known to be "s" or "b" to any contestant. The closer you get to the right classification of events as "s" or "b", the greater chance to win the money you will have.

Each event is a collection of numbers – more precisely 30 parameters describing kinematical distributions – that captures some properties of the resulting products of a proton-proton collision. There are many patterns in the numbers and some combination of the patterns is useful for deciding whether the event is an "s" or a "b". You may view the events as points in a 30-dimensional space $\RR^{30}$ and your task is essentially to develop a program drawing a map of this 30-dimensional world i.e. dividing this space to "b" (Bundesland; sorry, I didn't find a better word) and "s" (seas). ;-) One would clearly need unrealistically huge resources to remember the "bitmap" of the 30-dimensional space. Instead, you have to design (and apply) a system to construct a compressed JPEG-like 30-dimensional image of this space.

If you want to know, the "s" events are those in which you create a Higgs boson $h$ that decays to a pair of (3,500 times) heavy cousins of the electron known as the taus (a particle-antiparticle pair),\[

h\to \tau^+ \tau^-,

\] and the "b" events are those that just imperfectly imitate the "s" events. But in principle, it is being said that even programmers who don't understand any particle physics should have a chance to win. It is actually an interesting question whether programmers may train their computer and/or their brain to "see" whether something is a decaying Higgs boson.

If they learn to do it, they have mastered a "practical skill" in particle physics – something that animals would be forced to learn to do in practice if their survival depended on the discrimination of decaying Higgs bosons. However, I think that the "theoretical, deep wisdom" about particle physics is much more than any practical skill of this kind, and that's also why particle physics cannot be automatized.

The contestants who send their answer may see how they're doing in a "preliminary leaderboard" of results that only incorporates 18% of the events so it is just an approximation of the final results but it is likely to tell you a lot about your doing well or badly, anyway.

Incidentally, if I were picky, I would point out that in principle, one cannot sharply divide the events to "s" or "b" not only because the "b" events may parrot "s" events very closely but because in each single real-world event, the intermediate histories with a Higgs boson and without a Higgs boson interfere with each other before you get the probability amplitude for a final state.

So the existence of the Higgs – a new particle in the intermediate histories – only modifies the probability for a particular final state but you can never quite 100% certainly (even in principle, if you know absolutely everything about the final state that can be known) attribute the Higgs-like character of an event to the Higgs. It is a similar disclaimer as the usual comments that you can't attribute a particular hurricane to man-made climate change. The only difference between the Higgs boson and the CO2-hurricane link is that the existence of the Higgs boson actually does observably modify the composition (or the character of some) proton-proton collisions.

Non-physicists are unlikely to understand the meaning of the 30 parameters listed for each event. They are:

EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,DER_pt_ratio_lep_tau,DER_met_phi_centrality,DER_lep_eta_centrality,PRI_tau_pt,PRI_tau_eta,PRI_tau_phi,PRI_lep_pt,PRI_lep_eta,PRI_lep_phi,PRI_met,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label

Lots of information about energies and momenta of all the leptons and all the jets (partons). PRI_jet_num – probably the number of jets – seems to be the only integer among the 30 numbers. If that number is zero, the remaining ones are written as -999.0 because the information about jets is "N/A". The numbers of training events with 0,1,2,3 jets is about 100k, 78k, 50k, 22k, i.e. 250k in total, and no higher numbers are found.

But even if you know the meaning of the 30 numbers, e.g. because you are a particle physicist, there is no straightforward way to reverse-engineer them and to decide which collections of 30 real numbers are "b" and which of them are "s". Even to a trained particle physicist who knows how the Higgs may decay and who understands these labels above, none of the collections of numbers seems to say "b" or "s". Instead, all of them say "BS". ;-) So not being a physicist might not be a severe disadvantage, after all.

Only ATLAS is going to pay the money to you; CMS pays the corresponding money to Tommaso Dorigo for him to have some fun with Cicciolina.

Update: Excellent, I submitted my own random permutation with 1/3 of "s" labels, their computer was satisfied with the format, and I verified that with such unrefined submissions, one remains at the bottom of the leaderboard haha. I also have some real code discriminating the events but haven't submitted it yet. Then I posted two test submissions with a simple manual score read from the histogram. However, both of them were plagued by a severe bug in my Mathematica code: I thought that Ordering[...] produces the inverse permutation than what it does. However, Ordering[{2,3,1}] isn't {2,3,1} but its inverse. It was fixed by replacing it with Ordering[Ordering[...]] and my rank in the leaderboard jumped by six spots or so. However, the best score improved from 0.55 to 1.3 so it is clearly sensitive to such choices. ;-) I may apply the real, potentially competitive algorithm later.

IND2906

ATLAS: find Higgs, win $7k

You May Also Like

No comments:

Recent Posts

Facebook

Blog Archive

Popular Posts

Popular Posts

Social Widget

Random Posts

Recent Posts