repository for HH->bbWW analysis
For installation instructions, visit: https://twiki.cern.ch/twiki/bin/viewauth/CMS/TTHtautauFor13TeV-Tallinn
repository for HH->bbWW analysis
repository for HH->bbWW analysis
For installation instructions, visit: https://twiki.cern.ch/twiki/bin/viewauth/CMS/TTHtautauFor13TeV-Tallinn
I now have writing access to the original HME repository but I need to find some time to implement a solution that enables the verbose output only if appropriate flag has been set.
The problem right now is that the log files become extremely huge: one batch of analysis jobs generates up to 15GB of log files. I've used the following patch to minimize the excessive logging but it needs a better solution than that:
diff --git a/src/heavyMassEstimator.cc b/src/heavyMassEstimator.cc
index bdc306d..2276201 100755
--- a/src/heavyMassEstimator.cc
+++ b/src/heavyMassEstimator.cc
@@ -400,24 +400,24 @@ heavyMassEstimator::runheavyMassEstimator(){//should not include any gen level i
*htoWW_lorentz = *onshellW_lorentz+*offshellW_lorentz;
*h2tohh_lorentz = *htoWW_lorentz+*htoBB_lorentz;
if (h2tohh_lorentz->M()<245 or h2tohh_lorentz->M()>3800) {
- std::cerr <<" heavyMassEstimator h2 mass is too small, or too large, M_h " <<h2tohh_lorentz->M() << std::endl;
- std::cerr <<" gen nu eta "<< eta_gen <<" nu phi "<< phi_gen << std::endl;
- std::cerr <<" from heavyMassEstimator mu_onshell (px,py,pz, E)= ("<< mu_onshellW_lorentz->Px()<<", "<< mu_onshellW_lorentz->Py()<<", "<< mu_onshellW_lorentz->Pz()<<", "<< mu_onshellW_lorentz->E() <<")"<< std::endl;
- std::cerr <<" from heavyMassEstimator mu_offshell (px,py,pz, E)= ("<< mu_offshellW_lorentz->Px()<<", "<< mu_offshellW_lorentz->Py()<<", "<< mu_offshellW_lorentz->Pz()<<", "<< mu_offshellW_lorentz->E() <<")"<< std::endl;
- std::cerr <<" from heavyMassEstimator nu_onshell (px,py,pz, E)= ("<< nu_onshellW_lorentz->Px()<<", "<< nu_onshellW_lorentz->Py()<<", "<< nu_onshellW_lorentz->Pz()<<", "<< nu_onshellW_lorentz->E() <<")"<< std::endl;
- std::cerr <<" from heavyMassEstimator nu_offshell (px,py,pz, E)= ("<< nu_offshellW_lorentz->Px()<<", "<< nu_offshellW_lorentz->Py()<<", "<< nu_offshellW_lorentz->Pz()<<", "<< nu_offshellW_lorentz->E() <<")"<< std::endl;
- std::cerr <<" from heavyMassEstimator htoBB, mass "<< htoBB_lorentz->M()<<"(px,py,pz, E)= ("<<htoBB_lorentz->Px()<<", "<< htoBB_lorentz->Py() <<", "<< htoBB_lorentz->Pz() <<", "<< htoBB_lorentz->E()<<")" <<std::endl;
+// std::cerr <<" heavyMassEstimator h2 mass is too small, or too large, M_h " <<h2tohh_lorentz->M() << std::endl;
+// std::cerr <<" gen nu eta "<< eta_gen <<" nu phi "<< phi_gen << std::endl;
+// std::cerr <<" from heavyMassEstimator mu_onshell (px,py,pz, E)= ("<< mu_onshellW_lorentz->Px()<<", "<< mu_onshellW_lorentz->Py()<<", "<< mu_onshellW_lorentz->Pz()<<", "<< mu_onshellW_lorentz->E() <<")"<< std::endl;
+// std::cerr <<" from heavyMassEstimator mu_offshell (px,py,pz, E)= ("<< mu_offshellW_lorentz->Px()<<", "<< mu_offshellW_lorentz->Py()<<", "<< mu_offshellW_lorentz->Pz()<<", "<< mu_offshellW_lorentz->E() <<")"<< std::endl;
+// std::cerr <<" from heavyMassEstimator nu_onshell (px,py,pz, E)= ("<< nu_onshellW_lorentz->Px()<<", "<< nu_onshellW_lorentz->Py()<<", "<< nu_onshellW_lorentz->Pz()<<", "<< nu_onshellW_lorentz->E() <<")"<< std::endl;
+// std::cerr <<" from heavyMassEstimator nu_offshell (px,py,pz, E)= ("<< nu_offshellW_lorentz->Px()<<", "<< nu_offshellW_lorentz->Py()<<", "<< nu_offshellW_lorentz->Pz()<<", "<< nu_offshellW_lorentz->E() <<")"<< std::endl;
+// std::cerr <<" from heavyMassEstimator htoBB, mass "<< htoBB_lorentz->M()<<"(px,py,pz, E)= ("<<htoBB_lorentz->Px()<<", "<< htoBB_lorentz->Py() <<", "<< htoBB_lorentz->Pz() <<", "<< htoBB_lorentz->E()<<")" <<std::endl;
if (simulation){
- std::cerr <<"following is pure gen level infromation " << std::endl;
- std::cerr <<" nu1 px "<<nu1_lorentz_true->Px() << " py " <<nu1_lorentz_true->Py() << " pt "<< nu1_lorentz_true->Pt()
- << " eta "<<nu1_lorentz_true->Eta() << " phi "<< nu1_lorentz_true->Phi() << std::endl;
- std::cerr <<" nu2 px "<<nu2_lorentz_true->Px() << " py " <<nu2_lorentz_true->Py() << " pt "<< nu2_lorentz_true->Pt()
- << " eta "<<nu2_lorentz_true->Eta() << " phi "<< nu2_lorentz_true->Phi() << std::endl;
- std::cerr <<" onshellW mass "<< onshellW_lorentz_true->M(); onshellW_lorentz_true->Print();
- std::cerr <<"offshellW mass " <<offshellW_lorentz_true->M(); offshellW_lorentz_true->Print();
- std::cerr <<" htoWW mass "<< htoWW_lorentz_true->M(); htoWW_lorentz_true->Print();
- std::cerr <<" htoBB mass "<< htoBB_lorentz_true->M(); htoBB_lorentz_true->Print();
- std::cerr <<" h2tohh, pz " <<h2tohh_lorentz_true->Pz() << " Mass " << h2tohh_lorentz_true->M() << std::endl;
+// std::cerr <<"following is pure gen level infromation " << std::endl;
+// std::cerr <<" nu1 px "<<nu1_lorentz_true->Px() << " py " <<nu1_lorentz_true->Py() << " pt "<< nu1_lorentz_true->Pt()
+// << " eta "<<nu1_lorentz_true->Eta() << " phi "<< nu1_lorentz_true->Phi() << std::endl;
+// std::cerr <<" nu2 px "<<nu2_lorentz_true->Px() << " py " <<nu2_lorentz_true->Py() << " pt "<< nu2_lorentz_true->Pt()
+// << " eta "<<nu2_lorentz_true->Eta() << " phi "<< nu2_lorentz_true->Phi() << std::endl;
+// std::cerr <<" onshellW mass "<< onshellW_lorentz_true->M(); onshellW_lorentz_true->Print();
+// std::cerr <<"offshellW mass " <<offshellW_lorentz_true->M(); offshellW_lorentz_true->Print();
+// std::cerr <<" htoWW mass "<< htoWW_lorentz_true->M(); htoWW_lorentz_true->Print();
+// std::cerr <<" htoBB mass "<< htoBB_lorentz_true->M(); htoBB_lorentz_true->Print();
+// std::cerr <<" h2tohh, pz " <<h2tohh_lorentz_true->Pz() << " Mass " << h2tohh_lorentz_true->M() << std::endl;
}
continue;
@@ -1352,7 +1352,7 @@ heavyMassEstimator::bjetsCorrection(){
b2lorentz = *hme_b2jet_lorentz;
}
else {
- std::cout <<"wired b1jet is not jet with larger pt "<< std::endl;
+ //std::cout <<"wired b1jet is not jet with larger pt "<< std::endl;
b1lorentz = *hme_b2jet_lorentz;
b2lorentz = *hme_b1jet_lorentz;
}
@@ -1381,7 +1381,7 @@ heavyMassEstimator::bjetsCorrection(){
b1rescalefactor = rescalec1;
b2rescalefactor = rescalec2;
}else{
- std::cout <<"wired b1jet is not jet with larger pt "<< std::endl;
+ //std::cout <<"wired b1jet is not jet with larger pt "<< std::endl;
b2rescalefactor = rescalec1;
b1rescalefactor = rescalec2;
}
I'll disable them by default until we some control plots to show.
It's very likely that we're going to relax the lepton definition of the signal leptons in the DL analysis that is much looser than our definition of preselected leptons. In order to ensure that the SL and DL analyses don't overlap, we need to veto DL events in the SL analysis using the new lepton definition in the veto. The problem is that our Ntuples are storing only the leptons that pass the preselection cuts which presumably are tighter than the proposed lepton definition. This in effect means that all Ntuples have to be post-processed again if we want to include the missing leptons that would pass the new lepton definition but don't because of the preselection cuts.
For historical reasons, not all NanoAOD Ntuples have the LHEPart_status
branch. Initially (and officially), the NanoAOD FW saves only the status = 1 particles to the Ntuple. At some point I modified our CMSSW fork such that it also saves the status flag of the LHE particles to the Ntuple because some of the LHE Higgses had status = 2. All Ntuples that are produced since this change have the branch, but the Ntuples produced before don't. See HEP-KBFI/tth-htt#99 (comment) for more context.
This in turn brings us to the root cause, that HHGenKinematicsHistManager
uses these LHE particles to compute di-Higgs mass and cos(theta*) variables. This is completely redundant because we already have these variables pre-computed in post-production and read at the analysis level (EventInfo::gen_mHH
and EventInfo::gen_cosThetaStar
).
Recent modifications
Not directly related, but slightly related as it is how Wjj boosted subcats are defined
Before it slips off from my mind: need to include preselected VBF jets when computing the PU jet ID SF. Although it's a bit of an open item as to which jet collection should we use:
I think that 2. is the most accurate choice here because it would be more-or-less on the same footing with the cuts that the central jets are required to pass when entering the SF calculation.
Samples marked with [$] (modulo ggZH) we need to process first. The only complication here are the XS of the new DY samples.
Somehow missed that some of the DL samples contain HH->bbZZ events.
description on title
So that we don't lose track of the required changes:
This feature is already disabled in bb1l1tau bb1l channel, but not in TT1lctrl and in Wctrl. I'll fix it later today.
edit: I meant bb1l channel.
As discussed in the mattermost channel, we lose a bit of signal yield (eg 14% in SL @ 500 GeV) if we apply a flat SF in order to account for leptonic tau decays in the simulation. Instead, we should scale down only the W->tau nu events because the softer leptons for tau decays are less likely to pass our analysis cuts. In other words, instead of scaling everything down, we should scale down only the portion of our signal that is less likely to contribute to our SR.
Action items:
Uniformize H proc (TTH, TH, VH) namings/decays wrt to ttH analysis to apply correctly systematics.
Adapt the HH names to be easier deal with BR uncertainties in the combineHavester step.
Plus a few extra ttbar samples in 2016. Creating the issue in order to keep track of the progress.
Plan to switch to "fit2" described on slide 12 of this presentation because it's more accurate at higher energy scales (that we may reach in resonant searches).
I think that the copyHistograms
step makes the files that enter the hadd stage fairly small, which allows to increase the number of inputs per hadd job significantly. At the moment, the hadd stage generates too many jobs each of which consume a mere 250MB of memory, while the mem cap is at 2GB.
Require at least one fakeable lepton in the skimming, no limits on the multiplicity of hadronic taus. Maybe this has the potential to shorten the time needed in producing the datacards.
In the non-resonant powheg NLO HH and the non resonant VBF ones ones (anomalous couplings included) we need to have it normalized to the sample cross section as from HXSWG for interpretation (not necessarily the MCM one).
I would find and table of them, but then I find the safest solution is to re-do the samples dictionary to update the normalizations consistently (if that is automatic).
I have a feeling that I didn't update this repository in parallel with the multilepton repository when migrating to the latest HH reweighting scheme. At least lines like these:
hh-bbww/bin/analyze2_hh_bb1l.cc
Lines 596 to 600 in 1036129
The second issue reported by @saswatinandan is that the BDT Ntuples are not filled with the reweighting weights. The two problems might be connected.
Almost forgot about this bug as it was buried in a barrage of Skype messages. Opened the issue just to keep track of things, should be easy to implement though
Or, if it's not too much work, then configure addSystFakeRates
to yield the relative uncertainties automatically.
Creating another issue in order to keep track of the changes that are related to systematic uncertainties. Currently, on the table are:
Some of these items will be discussed in tomorrow's HH meeting.
The aim is to reduce the memory consumption by limiting the number of histograms that we book in the analysis, which is proportional to the number of sources of systematic uncertainties that we consider in an analysis job. (Creating the thread to keep track of the progress.)
Step 1)
Add the MVA to 4-jet-assigment:
Run the BDT mode and:
--> for the non-res case with the sum of BM samples reweighted!
[1] that is very much inspired on the HTT-tagger
==========================
step 2)
Implement the MVA in the format that you have it available here.
As it is now I am assuming
You can, of course change this logic if you want to test the result in other points (change the naming) and/or have a unique MVA to all phase space.
This result is then saved in a dictionary here, that are the list of MVA variables to be saved in the evt histograms in subcategories.
[2] I do suggest you try one simple making of cards (see here) with the working example that is pointed there once, to appreciate how to have MVA/prepareDatacards naming conventions that make sense (eg keep on it the MVA target -- or mass range target) help you in this step.
=====================
step 3)
For realistic limit results in terms of limits do the rebining exercise described here
ps.: Appart from this prototype implementation I still left the dumb test of the TF loading separated here and here, so that testing the TF compatibility is detached from this exercise.
Since running HME is quite an expensive task, it's somewhat prohibitive to run it in the analysis job. Creating dedicated workflow for HME analogous to MEM is out of the question as well because it has a lot of overhead in terms of human time -- setting up the workflow takes substantial amount of time, plus the bookkeeping becomes more complex, because we would have to deal with multiple sets of Ntuples: without HME and MEM, with HME and without MEM and with both HME and MEM.
Compared to MEM, however, HME is relatively fast. Considering that the we need to compute MEM and HME in the same channel, it makes sense to move the HME computation to the sample place where MEM is computed, so that both are computed in one go. The only downside to this are the shape uncertainties: if the JES or JER are varied, it may alter the estimated HH mass computed by HME. There are multiple options to handle it:
The MEM is implemented with the 3rd option in mind, but in practice we effectively use the 1st option and run MEM on the central values while ignoring the effects of shape systematics, in order to save some computing time. I think it's worth to:
The task itself can be broken down into following steps:
analyze_hh_bb2l.cc
to addMEM_hh_bb2l.cc
;MEMOutputReader_hh_bb2l
and MEMOutputWriter_hh_bb2l
, create classes that read and write HME masses from/to a TTree. The HME branch names should at the very least encode the systematics name (as is the case with MEM). The writer class should be used in HME+MEM jobs and reader class in bb2l analysis jobs;The testing & validation should be done using sync Ntuple:
Currently, none of the skimmed samples are usable because they're skimmed by the multiplicity of leptons that pass the "old" ttH lepton definition. We should probably consider skimming SL and DL separately, depending on its effectiveness.
As opposed to dR-based gen-matching and object cleaning, we should probably consider using the index-based approach as we already do in ttH analysis. This is more consistent with current/future analyses that are migrating to NanoAOD Ntuples.
We do not need two trainings.
By now, the half of the HH non-res LO events (== events that need reweighing) not used in application is the odd one.
In the BDT mode it loads all.
That is followed in this commit and this commit
One needs to pay attention
ps.: as no training is done in HH non-res NLO that does not need to be added to this logic
The problem is that embedded Python doesn't quite work when loading tensorflow, and the problem appears to be coming from tensorflow itself. Thus, the only realistic options that could work for us are:
In the first solution, we would have to spawn the script in a separate process because we want to construct the input model only once per analysis job. Constructing the model for every selected event is prohibitive due to timing constraints.
The second option is disfavored because it requires new (or modifying the existing) software that manages Ntuples and jobs outside CMSSW environment. It also adds another step between Ntuple production and analysis, and requires more human time because of additional Ntuple management.
to bookeeping,
--> we are doing that to compare at data card level with Florian/Agni
recent modification, related with the making of data cards for the above mentioned comparison
Following ttH the subcategories for datacards are implemented in such a way that only appear in evt folder
After yesterday's discussion it seems that we're still missing some systematic uncertainties that other groups have implemented:
The first two points are easy enough; the third requires some investigation. The fourth item is the most challenging, because our Ntuples simply lack the information needed to compute these SFs. Will need to discuss what our options are, because I don't think we can just ignore them since they rank relatively high in terms of impact.
Even if this is not necessary to be done I add here description on how to add more plots when asking for additional plots than the signal extraction ones (see this issue)
As promised, three lines, I will exemplify them following one plot making
and a forth to book the prepareDatacards making
@saswatinandan, when you get it, close the issue
Looks like the functionality of applying PU jet ID cuts was implemented only in the central AK4 jet selector class, but not in the b-jet selector class. I think we should set up the classes such that the b-jet selector class inherits from the central jet selector class. Should've done this long time ago, as there hasn't been any real advantage in keeping them separate.
The plan is to update single-electron and single-muon trigger SFs in 2016. Changing the list of HLT paths is also on the table.
And not do so many histograms if we want to do signal extraction only
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.