
MEMORANDUM 

TO:	William Charmley, Director, Assessment and Standards Division
Office of Transportation and Air Quality, U.S. Environmental Protection Agency

THROUGH:	Paul Machiele, Director, Fuels Center, Assessment and Standards Division

FROM: 	James Warila, Physical Scientist, Assessment and Standards Division

SUBJECT: 	EPA Response to Comments on the peer review of Assessing the Effect of Five Gasoline Properties on Exhaust Emissions from Light-Duty Vehicles certified to Tier-2 Standards (EPAct/V2/E-89: Phase 3)  -  Part II:  Data Analysis and Model Development

DATE:		April 5, 2013

The three peer reviewers of the EPAct Study Analysis selected by SRA International were Dr.  Xuming He (University of Michigan), Dr. Christian Lindhjem (ENVIRON), and Mr. Brian West (Oak Ridge National Laboratory).  EPA extends its thanks to all three reviewers for their efforts in evaluating this document.  The reviewers brought useful and distinctive views in response to the charge questions.  

This memo includes a compilation of comments prepared by SRA International and responses from EPA to each item.  We have retained the organization reflected in SRA's compilation of the comments to aid the reader in moving from the SRA report to our responses.  All textual edits and corrections provided by the reviewers are indicated in bold type.

This memorandum contains responses to the technical comments provided by the reviewers. The reviewers also made editorial comments on topics such as word choice, spelling errors, missing words, etc.  We made a point of correcting all editorial errors listed by the reviewers, but do not summarize these corrections in this document. They may, however, be viewed in the contractor's report prepared by SRA.

3.1	Specific Technical Comments

The reviewers provided a significant number of specific technical comments and were generally favorable in their reviews of the technical and statistical aspects of the EPAct Study Analysis.  This section contains their comments and is divided into those that specifically address either questions or requests contained in the peer review charge and additional technical comments that reviewers chose to provide.

3.1.1	Charge Questions & Requests for Comment

In addition to encouraging reviewers to best apply their particular area(s) of expertise to review the overall study, EPA drafted six questions or requests for comment to serve as a focus for the reviewers.  These were included in the peer review charge provided to the reviewers.  In varying degrees, the three reviewers provided direct responses to the questions or requests for comments. 

1.	Was the process of imputation of NMOG/NMHC results for tests/bags with missing speciation data reasonable and statistically sound?  (Section 2.2)
      
(He):  	Due to the speciation schedule described in Section 2.2 of the report, most tests in the dataset do not have alcohol and carbonyl measurements for bags 2 and 3. As NMOG and NMHC are calculated emission results that use speciation data, they could not be computed for the portions of the dataset without speciation. The study used imputation based on an alternate measure of hydrocarbon emissions to fill in the missing values. Linear location-scale-type models were used to imputation with special substitution of value zero for small NMOG. The report made a convincing case that the models used for imputation fit the data well, and should result in small errors and variability due to imputation. 
      
It seems that the imputed values were deterministic in the study given the variables NMHC in Equations 8-11. If so, statistical variability could be under-reported in the subsequent studies. One approach to recommend here is to use multiple imputations to account for the variability. Based on the descriptions in the report, I do not think that the additional variability due to imputation would be a significant factor, but I would prefer to see a more explicit discussion and examination of this issue in the study. 
      
RESPONSE:     
The reviewer is correct that the models used for imputation were applied deterministically, and that the imputed values represent means for levels of the predictor NMHCFID for each ethanol level. Time and resources do not permit a repetition of the process using multiple imputation. However, we have amplified the text to clarify that the models were in fact applied deterministically. We agree that the degree of error expected in the predicted NMOG and NMHC values is small.
      
(Lindhjem):  For example, this statement on Page 36 ("The alternate measure "NMHC as measured by FID" (NMHCFID), was collected for the entire dataset, and it very tightly correlated with both NMOG and "true" NMHC. It is thus possible to estimate NMOG and NMHC results for tests without speciation by using correlations generated from those with speciation. This technique essentially estimates the offset between the response of the FID and the fully characterized emission stream, due to the incomplete measurement of oxygenates by the FID. For NMOG, this estimated value is typically between 2‐20% higher than the NMHCFID measurement, depending on emission bag and fuel ethanol level.") refers to an apparently unique measurement NMHCFID and data handling approach, and it would be useful to understand this measurement in order to understand the validity of this statement. The correlation (described in section 3.2 Imputation of Speciated Hydrocarbons (NMOG, NMHC)) of NMHCFID to NMOG and NMHC appears to indicate that fuel ethanol level has no effect on this correlation (same slope for all levels of ethanol, Tables 10, 11 & 13, 14) for Bag 2 and 3, but an ethanol slope term for Bag 1. Yet, the correlations of acetaldehyde, formaldehyde, and ethanol in Table 82 for Bag 2 demonstrate ethanol still increases these oxygenated species and so should increase NMOG (consisting usually of NMHC + oxygenated carbon). Perhaps there is something unique about this NMHCFID measurement that allows oxygenated species to be measured at some level. However, the approach to correlating NMOG and NMHC with NMHCFID appears inconsistent for Bags 2 and 3 compared to Bag 1.

RESPONSE:     
For Bags 2 and 3, our analysis suggests that the ethanol level does affect the correlation between NMOG, NMHC and NMHCFID. For these Bags, an increase in the ethanol level changes the intercept, but does not change the slope, whereas for Bag 1 increasing the ethanol level changes both the intercept and slope. These patterns are consistent for NMOG and NMHC, although the direction of the changes is positive for NMOG (i.e., increasing ethanol associated with increasing NMOG) and negative for NMHC (i.e., increasing ethanol associated with decreasing NMHC).  However, the interpretation of this apparent difference in the statistical results between Bags 1 and 2 remains unclear. One possibility is  that the differences in the models represent actual differences in the underlying physical or chemical processes. Another possibility, which is equally likely, is that the dataset allows a more detailed, refined analysis for Bag 1 than for Bags 2 and 3.  It should be easier to discern fuel effects in Bag 1 emissions, for two reasons: (1) masses measured in Bag 1 are considerably larger, meaning that measurement error is lower in relative terms for Bag 1 measurements, than for Bag 2 or Bag 3 measurements, and (2) Bag 1 is dominated by cold-start emissions, which are inherently less variable than running emissions, which entirely comprise Bag 2 results and dominate Bag 3 results.  Thus, in terms of analysis, for a dataset of a given size, it should be possible to test more detailed hypotheses for Bag 1 than for Bags 2 and 3. 
      
(West):  Estimating NMOG from available NMHC appears reasonable and is consistent with what was done in the DOE V4 program (ORNL/TM-2011/234 and ORNL/TM-2011/461). For a given ethanol level, NMOG emissions have been shown to be a linear function of NMHC emissions.

RESPONSE:   
The patterns in the relationship between NMOG and NMHC presented in ORNL/TM-2011/461 are very similar to those presented in our report, despite the fact that the ORNL analysis focused on weighted composite results and ours treated Bags 1, 2 and 3 separately.
      
2.	Was the decision to remove very low emitting and influential vehicles from the NOx and NMOG analyses reasonable?  (Sections 5.5 and 6.1)
      
(He):  	Modifying or removing outliers and influential observations could raise questions about the validity of a study, especially when ad hoc decisions are made after the data are collected and examined. The report paid serious attention to data quality, and described how outliers and influential observations were identified. 

RESPONSE:   
We agree that it would be preferable, in concept, had we outlined an approach to influential, outlying and potentially invalid measurements prior to data collection. However, it is important to emphasize that conducting studies such as this project is extremely complex, detailed, time-consuming and expensive. Collection of measurements for the dataset used in this analysis required 18 months of sustained effort at a major research facility. Given the complexity of the instrumentation and procedures and the number of people involved over an extended time, it is improbable that no measurement issues will arise or that all issues can be identified in advance. Thus, while not ideal, it is probably unavoidable that some issues related to data quality and validity will be identified and addressed post hoc, during the analysis. We anticipate that the experience in acquiring measurements from vehicles with very low emissions in this study will reduce the number and frequency of such issues in future projects.
      
I believe that the decision to use left-censored models in the study is appropriate. This allows linear models to remain valid for the data with left end points. Such practices have been used in the statistics and econometrics literature. There are, however, several minor issues to deal with. 
      
In Section 5.2, censored measurements were replaced by the minimum positive value measured for the emission and bag. This substitution has the potential to lower the variance estimate of the statistical models, unless censoring is taken into account in the variance estimates. If the error variance estimates are deflated, we would see higher "studentized residuals", resulting in false positives in outlier detection. The issue needs to be carefully examined. 

RESPONSE:  

We agree with the reviewer regarding the potential for bias from the use of substitution of censoring measurements in the identification of influential measurements using the mixed model. This issue of substantial importance only for the PM models, as outlying measurements were removed only for these models. 

 The datasets for NMOG and NMHC had high numbers of censored measurements (in Bags 2 or 3) but no influential measurements flagged and no measurements removed.  Thus, the potential for variance deflation, resulting in the spurious identification of measurements as influential, would not change the outcome of model fitting for these compounds.

The dataset for NOx had a large number of censored measurements in Bag 3 only. However, no influential measurements were identified in Bag 3.  Influential Measurements were identified in Bags 1 and 2, but no measurements were removed. Thus, for NOx, errors in identifying influential measurements would not affect the outcomes of model fitting.

In the case of particulate, the situation needs closer evaluation in that three measurements were removed. One measurement was removed in each of Bags 1, 2 and 3.  The measurements in Bags 1 and 2 had studentized-deleted residuals of 4.45 and 4.27, respectively (Appendix G). These values exceed the threshold (3.5) by a margin large enough that accounting for potential bias due to neglecting the censored measurements in this analysis is unlikely to bring their studentized-deleted residuals below the threshold. More importantly, aside from the statistical analysis, a straightforward view of these points in the context of all measurements on the same vehicles makes it clear why they were identified as influential and selected for removal. 

In Bag 1, the run selected for removal (run 6247) has a value of 413 mg/mi, whereas all remaining measurements on the same vehicle (Liberty) have values less than 35 mg/mi. Further, the replicate measurement (run 6259) on the same fuel (16) has a value of 20.0 mg/mi, which is lower than that for run 6247 by a factor of 20.7.  See Figure 1 below.

Similarly, in Bag 2, the run selected for removal (Run 5284) has a value of ~110 mg/mi, whereas all remaining measurements on the same vehicle (Explorer) have values of 0.50 or less. The selected run exceeds all other measurements on all other fuels by a factor of 220. See Figure 2.

Finally, in Bag 3, the run selected for removal (Run 6281) has a value of ~62 mg/mi, whereas the remaining measurements on the same vehicle (Explorer) have values of 5 mg/mi or less. See Figure 3. Despite its size, this point was not flagged as influential.

In all three cases, the points identified are substantially higher than other measurements on same vehicle. In addition, the selected measurements were also higher than their respective replicates on the same fuels, by similar margins.  The selected measurements differed from the counterparts on the same vehicles by margins wide enough that, while we were unable to diagnose specific physical causes, we think it is safe to conclude that anomalies or errors occurred in the process or sampling or measurement. Thus, these three measurements were dropped by agreement among the study participants.

Figure 1.  PM (Bag 1):  Measurements for the Jeep Liberty, by fuel. The measurement identified as influential and selected for removal (Run 6247, on fuel 16) has an exceptionally high value (413 mg/mi).


Figure 2. PM (Bag 2): Measurements for the Ford Explorer, by fuel. The measurement identified as influential and selected for removal (Run 5284, on fuel 27) has an exceptionally high value (~110 mg/mi).



Figure 3.  PM (Bag 3): Measurements for the Ford Explorer, by fuel. The measurement identified as influential and selected for removal (Run 6281, on fuel 10) has an exceptionally high value (~62 mg/mi).

      
(b) At the end of Section 5.2, it was mentioned that Run 6281 in Bag 3 was removed even though it was not flagged as influential. The specific reason for this decision was lacking in the report. 
      
RESPONSE: 
See the response to the previous comment.
      
(c) In the analysis of Section 5.3.1, a distinction was made between light censoring and severe censoring. In the case of severe censoring, the Tobit regression was used in the data analysis. Otherwise, the censored values were substituted. I do not see good reasons for handling the two scenarios differently. Why not use the Tobit regression in all cases? The current practice would raise a question about the stability and sensitivity of the results if one more or one fewer data point is censored. 

RESPONSE:  
The distinction between "light" and "severe" censoring was adopted through consensus among the study participants (EPA, DOE and CRC). "Light" censoring was defined to include cases where only several measurements were censored. In these cases, we assumed that substitution would not result in serious bias. With light censoring, we applied substitution to retain the advantages of the mixed-model approach, namely, the ability to treat vehicles as random factors in estimation of standard errors for the fixed-effect coefficients, and the availability of the various diagnostics in the MIXED procedure, but not available in the LIFEREG procedure. 
      
(d) In Section 5.3.2, both BIC (model selection criterion) and likelihood ratio tests were described and used. Although both approaches are valid and useful, they have different goals in mind. Model selection based on BIC is to choose models, treating all competing models equally. The chi-square tests have a null hypothesis in mind, where the null hypothesis refers to the smaller models in the present analysis. Test decisions are designed to protect the null hypothesis, so the competing models are not treated equally. When both approaches are used, one needs to be clear how they work together, and what are to be achieved. I do not imply that anything has gone wrong here, but this part of the analysis needs to be made clearer as to why one cannot simply use BIC.

RESPONSE:  

In the final report, we have performed additional model fitting, revising the procedure so as to integrate the use of BIC as a ranking criterion with the use of goodness-of-fit tests in selection of reduced models. In the revised approach, all possible models were fit and ranked by BIC, to identify a set of leading candidates.  Then, because the BIC is one option among several available criteria, and because small to very small differences can occur in the criterion between candidate models, and because no approach is available to determine whether small differences in BIC are "meaningful" or "statistically significant,"  we performed a final round of fitting to select a final reduced model. In this step, all terms in the set of leading candidates were pooled to form a "superset" of terms.  Then, the "superset" was treated as a "full model" in a model fitting procedure using likelihood ratio tests to identify the subset of terms contributing to fit. This subset was retained as a final reduced model. This approach integrates the use of a screening criterion to avoid overlooking good candidate sets with the use of statistical tests to compensate for the use of a single criterion to make a final selection (i.e, BIC, AIC, Cp, R[2]adj, etc.).
 
      
(West):  These discussions were largely reasonable and convincing; however one case (run 6281) was not explained.

RESPONSE:   
This comment is addressed above.
      
3.	Please comment on the use of the "design set" of 11 terms as the basis of the final models, versus allowing model to fit all 17 terms, including the adequacy of the justification of this decision.  (Section 7)
      
(He):  The considerations and justifications on finding final models (Section 7.2) were reasonable and well thought-out. Because some subjectivity was involved in the final model selections, it is hard for me to tell whether each detailed review and scrutiny reported in Section 7.3 is "optimal". On the other hand, there is no single model that is likely to be the best. In reality, it is generally the case that several models are (almost) equally good given the limited amount of data, and the final selection can be made with some subjectivity. The analyses given in Section 7.3 appear reasonable.

RESPONSE: 
We agree that it is not possible to identify a "best" model in an absolute sense, as the selection is always contingent on the procedures and criteria used and the assumptions adopted in applying them.  Likewise, the discussion in 7.3 is not intended as "optimal" but simply as additional scrutiny to lay the groundwork for interpretation of the statistical associations described by the models.
      
(Lindhjem):  The approach to modeling is well considered given the low emissions rates of these vehicles and the relatively small fuel effects, often below normal detection limits. It was apparent that the iterative process using not only novel statistical techniques (compared with other fuel effects evaluations), but an understanding of the testing limitations (low emission rates coupled with detection limits) was a necessary method to determine relevant fuel parameters to include in the evaluation.
      
I agree with the approach of limiting the number of statistically fit terms (especially the second order terms) to only those that assist in explaining the fuel effects. Perhaps the discussion of the magnitude of the residuals could further highlight the lack of impact of terms that have been dropped from the correlations such as in Figure 70, where a trend may exist, but is relatively small in magnitude.

RESPONSE:  
The approach illustrated in Figure 70 (Figure 85 in the final report) is used to explore the possibility of the existence of interactions. As explained in the text, the residuals from a model including only linear terms are averaged and plotted.  In the plots, the existence of a trend across the mean=0.0 line suggests the possibility of an interaction. For the four potential interactions illustrated in the Figure 70, the results for the reduced models in Table 60 (FM10) (Tables 35 and 39 in the final report) show the first three interactions as significant (subplots a- e). The ethanolxT50 interaction, shown in subplot (d) is, however, not retained in the model, implying that it does not improve fit. A term that does not improve fit is generally insignificant as well. 
      
The overall statistical approach is sound using several methods to discover and systematically eliminate terms. The approach to identify influential data used appropriate physical (investigate detection limits and other laboratory variables) and statistical (significance level, BIC, and residual evaluation) methods to eliminate or keep data.
      
(West):  	This approach seems reasonable based on the discussion.



      
4.	Comment on the use of the Tobit regression (SAS PROC LIFEREG) for modeling datasets with large numbers of censored values.  (Section 5.3)
      
(He):  	I think that the use of left-censored models and the Tobit regression is appropriate. My only question here is why the same approach is not taken for cases with light censoring. See my comments under Item #2. 

RESPONSE:
Refer to response to Item 2 above.
      
(West):  This approach seems reasonable, although I would like to see "large" and "small" censoring levels explained further. Censoring of less than 5 values is considered small. Why? With over 900 datapoints, could the limit be set at a higher number? Please explain.

RESPONSE:
Refer to response to Item 2 above.
      
5.	Comment on the decision to independently model the linear terms and interactions between fuel blends as presented in the final results?
      
(He):  	Table 36 report(s) correlation coefficients between the linear-effects and the additional terms (interactions). I would suggest including the canonical correlation between the set of linear-effects and the set of interactions. This would assess linearity beyond pairwise correlations. 

RESPONSE.   
We are satisfied that the pairwise correlations are sufficient to indicate that the two-stage standardization is sufficient to neutralize important correlations among linear and 2[nd] order effects, and among the 2[nd] order effects. Thus, while the results of a canonical correlation analysis could be instructive, we think it unlikely that the additional analysis would have a substantial effect on the modeling approach adopted.

      
(Lindhjem):  
      
Statement (p. 9) "The analysis involved ongoing and iterative interaction between statistical modeling and additional physical and chemical review of the data." Page 23 "The final design is the result of an iterative process involving interactions between research goals, the feasibility of fuel blending, and experimental design."
      
Comment: The term "interaction" should refer only to statistically fitted second order terms where fuel parameters are mixed, such as ZZea , to represent the second order ethanol x aromatics term. The statement above appears to use "interaction" in a different context and so is confusing. EPA should also search all other uses of "interaction" to ensure that there is no confusion.


RESPONSE:   
The comment is well taken. We have revised the text accordingly to ensure that the term "interaction" is used only in the specific technical sense mentioned by the reviewer. 

Statement (p. 13) "Note that this generalization does not account for the effect of interactions between RVP and other properties, which are in some cases larger than the underlying linear effects."
      
Comment: I see only two RVP interacting terms affecting only running THC (not NMOG or NMHC) and start CO, and this statement only appears to be true for the CO start emissions. I would suggest either stating this more plainly or striking the comment. Or does this line refer to fuel properties that are affected when RVP is modified? For example, reduced T50 or diluted aromatics occur when lighter compounds are added to increase RVP; then this statement would be true for the other fuel properties than just for RVP.
      
RESPONSE:   
Considering the ambiguity in the passage cited by the reviewer, we considered it more helpful to readers of the final report to strike the passage rather than to revise it.
      
(West):	  This approach seems reasonable based on the discussion.
      
6.	Please comment on the methods used to select reduced models.  (Section 5) 

(He):  The considerations and justifications on finding final models (Section 7.2) were reasonable and well thought-out. Because some subjectivity was involved in the final model selections, it is hard for me to tell whether each detailed review and scrutiny reported in Section 7.3 is "optimal". On the other hand, there is no single model that is likely to be the best. In reality, it is generally the case that several models are (almost) equally good given the limited amount of data, and the final selection can be made with some subjectivity. The analyses given in Section 7.3 appear reasonable.

RESPONSE:
We have responded to this comment above, under item 3.
      
(West):	  This discussion was convincing, although I am not an expert in this area. I understand that reduced models have lower likelihood of the models describing the random error rather than the underlying fuel effects.
      
3.1.2	Other Specific Technical Comments
               
All of the reviewers provided specific technical comments in addition to their responses to the specific questions in the peer review charge.
               
(He): I have some additional suggestions, some of which might be useful for future studies. First, if a similar study is planned in the future, it would be better to construct specific criteria for removing outliers prior to data collection. This would eliminate questions about biased interference in the data processing stage. Second, when imputation is used, more careful procedures should be in place to account for variability due to imputation. Multiple imputation is a common approach to take. Third, some sensitivity analysis using robust statistical methods can be performed to understand the effects of outlying/influential points and their handling on the final analysis. Most statistical techniques, including linear mixed models and the Tobit regression used in this study, are based on the assumption of Gaussian errors. Robust statistical methods can help us understand the impact of non-Gaussian errors.

RESPONSE:
We appreciate the reviewer's suggestions and agree that they would be valuable when applied in future efforts.  

This analysis did assume Gaussian errors, which we consider a reasonable assumption based on extensive experience with emissions data.  It is generally known that vehicle emissions tend to follow approximately lognormal distributions, although there is discussion as to whether Weibull distributions can provide adequate fits in some cases. For this reason, the natural log transformation is commonly used to model emissions data, and is considered to give distributions for residuals that are approximately normal.  For this study, a parallel analysis performed an assessment of the normality of residuals and concluded that the natural log transform was adequate both to normalize residuals and to stabilize the variances of the residuals across the ranges of the fuel properties.
               
(Lindhjem):  
      
  The fuel properties in testing matrix used in the evaluation are in two respects quite different from previous studies. The higher levels of ethanol, up to 20%, are beyond what any previous study has considered. Also, the lack of olefins evaluation data eliminated one fuel parameter that was found to have a significant effect on emissions in previous studies.

RESPONSE:   
The inclusion of fuels with >10% ethanol was funded by DOE (a co-sponsor of this program) as providing potentially forward-looking data given the increasing interest in mid-level ethanol blends.  Regarding olefins, while EPA recognizes that this fuel property could potentially affect emissions from Tier 2 vehicles, at the time the study was being designed it was deemed to be of less importance than the five target properties selected.  A more detailed discussion of how the fuel parameters and their ranges were selected is available in the report describing the conduction of the study.  

      
One issue to be determined is how EPA or others will choose to extend the fuel property relationships developed in this report to the general fleet, such as in MOVES. Because vehicles will naturally age and may respond differently to fuel properties, these relationships may not continue to hold true. These higher emitting vehicles could potentially contribute to the emissions inventories out of proportion to their numbers. Care should be taken when extending the fuel effects to other vehicles, whether aged late model light‐duty or heavy‐duty gasoline powered vehicles.

      
RESPONSE:   

We acknowledge the importance of the questions raised by the comment. However, we note that this comment goes beyond the scope of the report under review, which concerns the design, execution and analysis of the Phase 3 study. 
Going beyond this analysis, the comment concerns the application of the study results in a context such as the MOVES model.  We conclude that the comment is more appropriately posed and addressed in a venue providing for review of application of the study results, as opposed to their generation, as described in this report.

      
[3]  Statement (p. 181) " The typical hydrocarbon analyzer used for emission testing uses a flame ionization detector (FID), which is calibrated to accurately count carbon atoms that are bonded to hydrogen. Carbons bonded to oxygen, which occur in carbonyl and alcohol emissions from burning ethanol fuels, are not accurately counted by the FID, and thus emissions from ethanol fuels require additional characterization methods to properly quantify as NMOG or VOC."
      
Comment This statement is generally, but not strictly, accurate in that most FID units are calibrated on propane, and the hydrocarbon measurement assumes that all carbons in the sample respond the same as the carbon atoms in propane. Hydrocarbons do not necessarily respond identically as propane carbon atoms do, but carbons bound to oxygen respond at a rate order(s) of magnitude lower. The error for most hydrocarbons has been considered insignificant. The response to oxygenate carbons has been historically considered to be insignificant compared with other hydrocarbons in the sample, so carbonyl and alcohol compounds determined through alternative methods are added to the NMHC weight measure as the weight of single carbon aldehyde (formaldehyde) and alcohol (methanol). The different composition of hydrocarbons and any FID response to oxygen‐bound carbon influence the NMHC measurement, but those influences are considered minor. With lower emission rates of these newer vehicles, these assumptions may not be as appropriate as for older vehicle designs with higher emission rates. In addition, does the `typical hydrocarbon analyzer' differ from the NMHCFID measurement that needs to be correlated with NMHC before the statistically modeling proceeds?
      
RESPONSE:   
The statement in the report as reviewed seems like a less detailed summary of the commenter's explanation, but not necessarily inconsistent with it.  Some minor clarifying edits have been made.  Regarding nomenclature, the program results include four measurements of what should perhaps most properly be called "organic gases": THC (the raw FID analyzer output), NMHCFID (THC minus methane), NMHC (NMHCFID minus the partially-measured oxygenated species), and NMOG (NMHC plus the full masses of oxygenated species as measured by speciation methods).  We are aware that in the past NMHC has often been used to represent the same set of species as what is here defined as NMHCFID (and that other computed values, like "NMHC equivalent" (NMHCE), have been used in different contexts).  The primary driver for the interest in computing and modeling NMOG is the presence of substantial amounts of oxygenated emission species due to ethanol blends, which are not fully captured by simply reporting NMHCFID (THC minus methane).
      
[4] "8.2.2.4 Vapor Pressure" (p. 182)
      
Comment I found this section description (and others that rationalize the effect modeled) more of an unsatisfying hand waving exercise to justify what the data was telling. If indeed the older studies and this most recent evaluation are to be believed, then the modeled effect may be temperature dependent when at high temperature the feedback of vapors to the engine may affect the emissions opposite to that modeled here. The suggestion is to limit the speculation to the results of the analysis presented and note other references if the effect found needs to be justified.
      
RESPONSE:   
We do not see the discussion in this paragraph as an attempt to "justify" the results of the current study, but rather to interpret them and to relate them to previous findings, a routine practice in the discussion sections of peer-reviewed papers. Given the complexity of the physical and chemical processes involved in fuel combustion, interpretation is inherently uncertain and generally involves some conjecture. Nonetheless, considering the speculative nature of the discussion, we have elected to remove this paragraph from the final report.   

      
(West):  
      
[1] This reviewer has over 20 years experience in engine, emissions, and vehicle testing, but limited experience with many of the statistical methods and modeling approaches described. To the extent possible in the short time available, methods and terms were researched and explored while reviewing the EPA document. Explanations and approaches appear largely reasonable. Some specific questions are noted in comments in the report. For example, the Bayesian Information Criterion (BIC) is used to determine goodness of fit. In one example, the BIC for the full model starts out at >2900 and decreases by 0.1-0.2% for subsequent reduced models. The discussion refers to "a steady decrease" in BIC. The reviewer suggests the significance of such a small change in BIC be discussed further. While the reduction in BIC appears insignificant, the fact that BIC is not increasing is perhaps the justification to use the reduced model (sec. 5.3.2).
      
RESPONSE:   
This question arose repeatedly during the analysis.  We did pose the question of whether small differences in BIC represented "real" or "significant" differences between different models. Unfortunately, no statistical test is available that can be applied to values of the BIC. This limitation on the applicability of the BIC is one reason why it was supplemented with the use of likelihood- ratio tests in model fitting. 

The BIC can be described as the sum of the -2 log likelihood and a second term representing the "information content" of the model. A model with more terms will give a reduction in BIC only if the associated increase in "information content" causes a correspondingly larger reduction in the -2 log likelihood (indicating an improvement in fit). Thus, the addition of terms is worthwhile if the improvement in fit exceeds the additional "information cost." Conversely, the addition of terms is not worthwhile if the "information cost" exceeds the improvement in fit. This approach has an advantage in that it allows multiple models to be compared on an equal basis, even if the terms included in different models vary.

The use of likelihood ratio tests is an analogous but not identical approach in that it examines differences between -2 log likelihood statistics between two models at a time, while accounting for the difference in the numbers of terms in each.  However, this approach differs in that one of the models, having fewer terms, must be "nested" within the second model, which has more terms. That is to say, all terms in the smaller "reduced" model must be a subset of those in the larger "reference" model. In addition, the LRT approach "favors" the "reduced model" over the "reference model" in that the null hypothesis for the test is that the additional parameters in the reference model do not provide an improvement in fit. Thus, the additional terms will be retained only if the improvement in fit is large enough to lead to rejection of the null hypothesis. 

In any case, in revising results for the final report, we applied BIC as a screening tool to aid in identifying relatively small sets of good-fitting candidate models, after which we employed statistical tests to select a "best fit" from among the leading candidates.  In this approach we compensate for the selection of a single screening criterion and also can effectively test the significance of small differences in the screening criterion.
      
 The authors appear to have taken great care to ensure that models are as representative as possible, and to avoid overly complex models or overfitting. The handling of "nondetects" appears reasonable. There is rigorous treatment and discussion surrounding background measurements, analyzer drift, limits of quantitation, censoring of data, etc. Nonetheless, the complexity of the problem makes it difficult to see whether all of the objectives were achieved or rather that some of the apparent results are any more than artifacts of an intricate math problem. As an example, all emissions data are shown by vehicle and include all tests on all fuels, to show the range of measurements for a given pollutant, by vehicle. These are very informative charts. However, no data are shown to demonstrate the test-to-test repeatability for a given vehicle with a given fuel. Several figures (such as Figures 13-21) show averages, but would be more informative if range bars were shown to indicate max and min or perhaps interquartile range, or perhaps scatterplots with all data points.
      



RESPONSE:    
Plots such as Figure 12 do in fact show all data points including replicates for individual vehicles on specific fuels. Any two or more data points aligned vertically on the plot represent replicate measurements on the same fuel. Presentation of all results using a common log scale allows the variability among the low values to be visualized, as well as that among high values.

      
[3] Authors discuss taking care to not overfit or model random scatter (which is good), in fact stating that the Bag 3 models may be more vulnerable to measurement error due to the extremely low emissions, and opting not to report model coefficients for Bag 3 results. The reviewer agrees with and commends this decision. Handling of extremely low measurements is difficult, and the authors discuss this issue extensively and convincingly. However, when it comes to the PM emissions measurement, there could be additional discussion and justification. For instance, one issue that deserves additional discussion is the lack of tunnel blanks for background PM. Data from a few example tunnel blanks could provide convincing evidence of the zero background assumption. Furthermore, control data should be shown, or at least cited, to support the assumed level of error (+- 1 μg) in the PM filter weight measurements. The authors mention discussions with EPA staff as the basis for the assumed level of error. As the authors know, PM measurement is very sensitive to measurement precision and accuracy, with temperature, humidity, buoyancy effects, static charge and other factors greatly influencing results. In SAE 2005-01-0193, the authors detail exhaustive measures to attain an accuracy of better than +- 1 μg in their filter weights. Please provide evidence of the stated PM measurement error.
      
RESPONSE:
We have added information and text to the report to address this comment.
      
[4] In Section 7, mean residuals are plotted, presumably to demonstrate the quality of the models. It would be interesting to see range bars on the data points to show the range of individual residuals.
      
RESPONSE:  
As mentioned in the text on this topic, we averaged the residuals of the linear-effects model to allow examination of trends in the means that could indicate the presence of interaction effects. In this context, presenting individual residuals would obscure the underlying patterns.

      
 [5] (p. 9) This approach was followed for several reasons: (1) the candidate fuel effected identified for study were selected because we anticipated that they could be important for one or more emissions.  Not clear.  Candidate fuels?  Or candidate fuel effects?
      
RESPONSE:   
The statement is unclear because it should read "candidate fuel effects" rather than "candidate fuel effected." We have modified the text accordingly.
      
[6] (p. 13) Ethanol:  taken in isolation, the models indicate that increasing ethanol is associated with increases in all emissions, both for cold-start and hot-running emissions. The sole exception to the pattern is CO, for which the response to fuel properties appears to change between start and running. The effects are strongest for PM, NOx and NMOG, although presumably, the underlying physical processes could vary.  Interestingly not consistent with V1 (ORNL/TM-2008/117) or V4 (ORNL/TM-2008/234).  Increasing ethanol (in splash blends with certification gasoline) decreased NMHC and THC (FID_HC), while NMOG was relatively flat.  Ethanol and acetaldehyde increased, of course.  V1 used LA92 but V4 used FTp.
      
RESPONSE:  
It is critical to remember, when interpreting coefficients presented throughout the report, that the coefficients represent the effect of the fuel property as though in isolation from the others. The coefficients represent the effect of the fuel property, only to extent that a single property can be modified without affecting any of the others. In practice, the extent that such a modification can be realized is limited. In interpreting and applying models, then, it is necessary to think in terms of all five properties and their interactions, rather than to think of terms of a single fuel property without reference to the others.

This rule is important if results obtained from EPAct Phase 3 are to be compared to results from studies that measured emissions on specific fuel blends, but without benefit of an experimental design to "orthogonalize" the different effects. In such cases it is necessary to apply the models to estimate the net result for all properties representing a given fuel, rather than to attempt to evaluate the effects of individual properties. 

We have revised the text in the final report to clarify these points.

      
[7] (p. 50) At the outset, it is helpful to get an overview of the raw results, sorted by vehicle and fuel, which gives an initial impression of variability among vehicles and fuels, as well as within vehicles.  Agreed.  Also would be helpful to show variability of individual vehicles on individual fuels.
      
[8] (p. 51) In the plot for etOHxT50 (Figure 18), the view seems to indicate an upward trend from E0 through E20, but with some downward curvature above E10.  Show all data?  How much scatter is there in NOx for these cases?

RESPONSE:  
For purposes of these figures, means alone were plotted to allow examination of trends in the data. There is a fair amount of scatter in the individual measurements, as shown in other presentations, such as Figure 12.
      
[9] (p. 52) At first glance, the trend appears to "zig-zag," from low to high.  Test to test variation of Bag 1 NOx in V1 study (LA92 cycle) ranged from 10% to over 100% (same veh, same fuel).
In V4 program (FTP, not LA92), range of weighted composite NOx approached 100% in some cases.  How do you ensure that coincidental test-to-test variation is not erroneously attributed to a fuel effect?

RESPONSE: 

We expect that test-to-test variability is similar in the EPAct data, as in V1 and V4. Test-to-test variability is always a potential factor, but we think it more probable that the effect is due to characteristics of the fuel matrix than to test-to-test variability as such. As the discussion in the rest of the paragraph shows, we suggest that not all points in the graph merit equal weight in assessing the presence or absence of a trend. In Figure 19, if the center green point, leftmost red point and center black point are discounted somewhat, the figure is suggestive of a positive trend for T50, which is visible, but not as large as that for aromatics, shown in Figure 21. The discussion in Section 4 is prospective, based on an initial review of the data prior to model fitting. In this case, the model for Bag 1 NOx shown in 8.2.1 (Table 65) has a positive coefficient for T50, confirming the initial impression.

      
[10] (p. 76) [Bag-1 PM vs. ethanol level]  The Linear Effects plot for ethanol shows some mixed results (Figure 33), but with an apparent increase from 0% to 10% ethanol, followed by a leveling or decline at higher ethanol levels.  Decline would be expectation.  Increase from E0 to E10 is odd.
      
RESPONSE:   
The figure illustrates what averaging the data shows in this case, with the results "orthogonalized" across the other effects. After reviewing Figures 34, 37 and associated interactions, and after model fitting, it is clear that balancing across aromatics and T90 is particularly important. Studies based on sets of fuels not similarly balanced with respect to these properties may give apparently different results.
      
[11] (p. 93) In this initial step, censored measurements were replaced with the minimum positive value measured for the emission and bag.   Minimum across all vehicles and fuels?
      
RESPONSE:  
Correct.  The values used were the minima across all vehicles and fuels.
      
      
[12] (p. 100) In this case, the BIC declines steadily as terms are removed, indicating an improvement in fit for each successive reduced model.   Declining steadily by 0.1 to 0.2% does not seem significant or important.  Is the improvement cited below truly due to the simpler model?  Perhaps, but is the BIC really an accurate indicator?

      
[13] (p. 101) Table 20.  Model Fitting History for PM, Bag 1 (FM9 selected as best-fit model).  Is the sensitivity of BIC such that there is a significant difference between 2862 and 2867?  Six significant figures and 0.1% change in BIC does not seem important.  If this is important or significant, it should be explained. Perhaps the important point is that BIC is NOT INCREASING as the model is simplified, thus justifying the simpler model?
      
RESPONSE (to comments 12 and 13):
We agree that it is difficult to assess the importance of small differences in the BIC. In this case, the final model selection was based on the likelihood-ratio tests.
      
[14] (p. 120) In this program, dilution air was HEPA-filtered and presumed to be free of PM, so there was no background filter sample collected for later subtraction as is typical with other emissions.  Why no tunnel blanks to prove this assumption?

RESPONSE: 
We have added text and information to the report to address this comment.

      
[15] (p. 120) Discussion with EPA staff experienced with PM measurement suggests that for the data as collected in this program a variability of +-1 μg should be applied to all filter weights.  Considering that the net PM result is calculated by subtracting two filter weights (average dirty minus average clean), it should be understood to have a variability range of +-2 μg, as the measurement error applies to both weights.  Therefore, a net weight gain of 10 μg would have a relative error of 20% associated with it, a figure of the same order of magnitude as the fuel effects this program attempts to capture.  Need to cite reference(s) or present data to establish this level of error.  For example 2005-01-0193.

RESPONSE:
We have added text and information to the report to address this comment
      
      
[16] (p. 149) Interaction plots for selected terms are shown inError! Reference source not found..  Y axis title is "mean measurement."  Isn't this a modeled result, not a measured result?  

RESPONSE:
No.  In Figure 65 (Figure 87 in the final report), the results are averages of data, not predicted results. In Figures 66(88) and 67 (89), the values plotted are means of residuals, calculated as measured minus predicted.
      
[17] (p. 149) Another way of viewing the interactions is to average and plot the residuals of the linear effects model.  Compare modeled results to measured results.  Mean is good, but would also be nice to see the range of residuals.  Mean can be close to zero, but how much variation is there? (average of + 20 and -20 is zero).

      
[18] (p. 161) Figure 69.  NOx (Bag 1): Mean Residuals for the Linear Effects Model, vs. Target Fuel Properties for four pairs of terms: (a) Ethanol x Aromatics, (b) Aromatics ethanol, (c) Aromatics x T90, (d) T90xAromatics, (e) RVP x T90, (f) T90 x RVP, (g) ethanol x T50, (h) T50 x etOH.  Same comment as above.  Mean residuals look good.  What is range of individual residuals?  Perhaps add error bars to show min and max or perhaps interquartile range of residuals?
      
RESPONSE (to comments 17 and 18):   
Figures 65-67 (Figures 87-89 in the final report) present results of an analysis designed to detect the presence of interaction terms by examining patterns in residuals from a model including only linear effects (with no interactions). In this presentation, a trend across the zero line, whether positive or negative, suggests the possibility that an interaction exists. The residuals were averaged to more clearly show trends. 

      
[19] (p. 178) This question does not arise in the context of model application, but rather with respect to model validation, in that the datasets available for validation include data that represent emissions from pre Tier-2 vehicles.  Such as the DOE V1 and/or V4 datasets?  Have the models been compared/validated against V1 or V4 data?  Note that V1 used splash blends and ran the LA92 cycle.  V4 tests also used splash blends but ran the FTP.

RESPONSE:
When comparing model predictions to results measured on other sets of vehicles, particularly pre Tier-2 vehicles, it is necessary to account for the fact that emission levels, and hence, intercept terms, will vary widely and can confound comparisons unless appropriately accounted for.  Attempts to verify the models using results from V1 or V4 are complicated by this issue. We are not aware of other suitable datasets at this time. As a result, model verification is an area for future work. 
      
      
[20] (pp. 180-81) Thus, despite much lower overall emission levels that have been achieved in Tier 2 vehicles through improved fuel control and catalyst efficiency, the effect of ethanol on combustion and aftertreatment(?) appears to persist in certain modes of operation such as cold-starts and transients during warmed-up operation.  Effect of ethanol on NOx emissions may be related to exhaust stoichiometry and catalyst efficiency more so than (or in addition to) changes in engine-out NOx.  

RESPONSE:    
This program addressed effects of the fuel properties on emissions measured at the tailpipe, thus including any effects on combustion and aftertreatment.  To avoid the impression that we are attempting to interpret them separately, we have modified the text to read: "the effect of ethanol on tailpipe emissions appears to persist ..."
      
[21] (p. 181) (Should we address effect on THC/NMHC as well?)  Yes.  Note that NMOG in V4 was not affected by ethanol, but FID_HC and NMHC decreased (with splash blends).
      
[22] (p. 181) In the present study NMOG decreased with decreasing aromatics content, in agreement with earlier studies.  Or increased with increasing aromatics...

RESPONSE: 
To clarify, we have rephrased this sentence to read:  
"... NMOG has a positive coefficient, indicating that NMOG increases with increasing aromatics, or decreases with decreasing aromatics, in agreement with previous studies."
      
[23] (p. 199) With respect to censoring, the following rule was applied.  If the number of censored measurements was <= 5, we substituted the smallest measured positive value for the missing values, and proceeded with model fitting, using a mixed-model approach.  Why 5?  Why not 4 or 6 or 10?  Explain.

      
[24] (p. 199) However, if the number of censored measurements was > 5, we fit a model using Tobit regression (i.e., "censored normal regression"), an established technique for analysis of left-censored datasets.  Ditto.  What is significance of 5?
      
RESPONSE (to comments 23 and 24):
See response above, under Item 2.  
      
[25] (p. 203) Based on these results, the reduced model FM7 was selected as the best fit.  As noted in previous BIC discussion, do small variations in BIC truly indicate a difference in fit?  FM5, FM6, and FM7 all have similar BIC.
      
RESPONSE:
In this analysis, the "best fit" models were selected on the basis of the likelihood-ratio tests, with the BIC presented as corroborating information. 
               
3.2	General Comments

Two of the reviewers provided general comments on the EPAct Study Analysis.  Among these general comments were evaluations of the report's strengths, suggestions for improving and strengthening certain of its elements, and queries for further information.
      
      Lindhjem:  
      
[1] The report . . . presents a well‐documented approach to estimating the effect that gasoline fuel properties have on late model vehicle emissions. In general, the approach to evaluating the data is equal to or more robust than previous efforts, such as the complex and predictive models or other fuel effects studies.
      
[2] I was unable to find the earlier reports "EPAct/V2/E‐89" referenced in the document, presumably "Assessing the Effect of Five Gasoline Properties on Exhaust Emissions from Light‐ Duty Vehicles certified to Tier‐2 Standards: Part I ‐ Study Design and Execution (EPAct/V2/E‐89Test Program Final Report)". This report likely describes in better detail why the fuel blending program could not produce an orthogonal fuel matrix. Likewise, questions about the measurement, vehicle conditioning, and other issues are not addressed in Part II. This makes it difficult to assess the relevance of the modeling.


RESPONSE: 
The Testing Report does add detail but adds nothing essential as to why the fuel matrix could not be entirely orthogonal. The essential point is that ethanol content and T50 cannot be manipulated independently outside a relatively narrow range. Most importantly, fuels with high ethanol (> 15%) and high T50 (>170F) cannot be blended. 


      
(West):  

[1] Models should be validated against independent datasets, as discussed in section 8.
      
RESPONSE:
See our earlier response regarding model verification.
      
      
[2] (p.8) The program was conducted in three phases. Phases 1 and 2 were pilot efforts involving measurements on 19 light-duty cars and trucks on three fuels, at two temperatures.  This work was completed at Southwest Research Institute between September 2007 and January 2009.  Have these results been published?


RESPONSE:  
No. these results had not been published at the time when this review commenced. However, since that time, these efforts have been documented and will be available in the public docket for the Tier 3 NPRM.  As mentioned, these early Phases were pilot efforts intended primarily to assess the magnitude of fuel effects to be expected and to finalize testing and analytical procedures to be used in the main effort (Phase 3).  In addition, irregularities in blending the fuels used in the pilot phases complicated the interpretation of results obtained.
      
      
[3] (p. 16) This report describes the analysis of the dataset collected in Phase 3 of the EPAct/V2/E-89 program, conducted at Southwest Research Institute in San Antonio, Texas.  A separate report describing the program design and data collection activities is available, but an overview is provided below.  This report does not appear to be publicly available at this time (January 2012)

RESPONSE:   
Unfortunately, the report wasn't available at the time of the review.
      
[4] (p. 18) An initial sample of 19 test vehicles was chosen with the intent of representing of the latest-technology light duty vehicles being sold at the time the program was being launched (model year 2008).  In terms of regulatory standards, the test sample was to conform on average to Tier 2 Bin 5 exhaust levels and employ a variety of emission control technologies, to be achieved by including a range of vehicle sizes and manufacturers.  I recall EPA staff indicating (c. 2007) that the list of vehicles was a projection of future technology and/or engine families
      
RESPONSE:   
At the time the vehicles were being acquired, EPA staff were aware that changes in some engine families had been recently made by the manufacturers, or were imminent.  This development influenced specific selections to some extent.  However, the sample did not include technologies such as direct injection, turbocharging, hybrid drivetrains, etc.  The scope of the study was to produce a thorough characterization of emissions from vehicles employing currently marketed technologies and certified to Tier 2 standards.
      
      
[5] (p. 20) After some consideration, study participants agreed to rely on the aggregate data, while applying appropriate techniques do to address the resulting "censoring" of the data at low end of the range of values.  Who are the "study participants?"
      
RESPONSE: 
The study participants are listed at the beginning of the introduction. It would be redundant to list them again at this point in the text.
      
      
[7] (p. 32) Table 8 shows that the combination of one- and two-stage standardization neutralizes the remaining correlations, with the exception of that between the etOH and T50 linear effects, as previously described.  Unclear "that between the etOH and T50 linear effects (what)"
      
RESPONSE:
We have modified the text to read:  " ... with the exception of the correlation remaining between the linear effects for ethanol and T50."
      
      
[8] (p. 36) For these reasons, the decision was made not to replace zeros in the dilute bag dataset with integrated continuous measurements.  Agree with this decision.
      
[9] (pp. 36-37) Based on strong correlations between these species, we developed statistical models to impute NMOG and NMHC from corresponding NMHCFID measurements.  See ORNL/TM-2011-461
      
RESPONSE: 
We agree that our results are similar to those described in the ORNL report. Accordingly, we have referenced and cited the ORNL report in the final report.

 [10] (p. 51) Trends for individual vehicles show a general increase in NOx with increasing ethanol, with some exceptions.  Consistent with prior studies ORNL/TM-2011/234; SAE 2009-01-2723

      
[11] (p. 91) We assume that a very small but positive measurement existed but was not captured and quantified.  Assigning a value of zero to these observations is an example of a common approach to censoring of observations, known as "substitution."  In this approach, a small but fixed quantity is substituted for the censored observations.  Values used for substitution include zero, as mentioned, or small but positive quantities such as the smallest observation, a multiple of the smallest observation, the limit of quantitation (LOQ) or half the limit of quantitation (LOQ/2). The degree of censoring varied widely by emission and bag, as shown inError! Reference source not found..   At different stages of the analysis, we addressed censoring in different ways.  These various approaches all make sense.  Suggest adding couple of sentences about the various bags of the LA92.  Bag 1 is cold with majority of emissions, hence less censored values.  Bag 2 is hot, but includes some open-loop operation due to hard acceleration.  Bag 3 is hot start bag with very low emissions, hence majority of censored values.
      
RESPONSE:
We have added a paragraph describing the three bags of the LA92 and their relationships to levels of censoring observed.
      
[12] (p. 92) Table 15.  Numbers of Censored Measurements, by Emission and Bag.  Suggest showing total number of tests (in title or footnote).

RESPONSE:
We have added the total number of measurements to the table.
      
      
[13] (p. 94) Table 17.  Counts of Influential Measurements, by Emission and Bag (with "influential" defined as having a studentized-deleted residual >= 3.5 or <= -3.5).  Suggest including total number of measurements or observations

RESPONSE:
We have added the total number of measurements to the table.
      
[14] (p. 95) An additional measurement in Bag 3 (run 6281) was removed, even though it was not flagged as influential.  Explain further?

RESPONSE:   
We have added text and a figure explaining the reason for removing this measurement. Specifically, the measurement for this run was higher than all other measurements for the vehicle by a wide margin, including the replicate measurement on the same fuel.
      
      
[15] (p. 95) The full sets of terms in the optimized design include terms anticipated to be meaningful for any of the emissions to be measured.  However, it was not anticipated that all the terms included would necessarily be meaningful for all emissions in all bags. A closely related goal is to develop models that would be, to the extent possible, explicable in terms of knowledge of the relevant physical and chemical processes.  Parsimonious models are preferred over full models for this purpose, as their simpler structure makes their behavior easier to assess and explain.  Finally, with respect to explicability, it is much preferred to minimize the potential for overfitting, which could reduce the generality of models selected for prediction.  To guide the process, we adopted several assumptions, described below.  Model describes the random error rather than the desired underlying relationship.  Good discussion.
      
[16] (p. 96) For minimal levels of censoring, defined as five or fewer censored measurements (ncensored <= 5), we elected to substitute the minimum positive measured value for the missing measurements.  After substitution we fit mixed models as described above.  N<5 seems reasonable if Total number of samples is large.   Suggest noting Ntotal here.  Why 5? That is, why not 4 or 6 or 10?  Is 5 arbitrary or is there a citable reference to why 5?

RESPONSE:   
See response above on this item.
      
[17] (p. 107) The approach to analysis of censored measurements, as described in 5.3.2, was also adopted based on guidance from the author of the DOE research.  No public documents can be found on the EPAct/V2 study.
      
RESPONSE:   
This report was unavailable for public release at the time this review was initiated.

[18] (p. 116) If this program were simply trying to quantify the magnitude of NOx emissions from such vehicles, this level of error may be acceptable.  However, since we are looking for meaningful differences in emissions between fuels, this large relative error is particularly problematic.  Key point.  Meaningful differences.
      
[19] (p. 117) This suggests it likely has higher measurement noise than data from the other vehicles, and thus many measurements may not be reliably distinguishable from background levels.  Good
      
[20] (p. 149) It is interesting to note that while (d) and (f) appear somewhat similar visually, the model considers the aromxT90 interaction highly significant but the RVPxT90 interaction insignificant.  Explain.



RESPONSE (referring to Figure 87 in the final report):   
The statement simply points out that while subplots (d) and (f) show broadly similar patterns in trends for different aromatics levels, the interaction shown in (d) (aromxT90) is found to be statistically significant in model fitting, whereas that in subplot (f) (RVPxT90) is not found to be significant. This result probably follows from the fact that the trends cross more conspicuously in subplot (d) than in subplot (f).
      
      
(p. 150) < insert physical interpretation of these interactions here?>  Yes
      
      
[21] (p. 165) One is that the magnitudes of corresponding coefficients are generally larger for Bag 1 than for Bag 2 emissions, suggesting that the effects of fuel properties are more pronounced for "cold start" than for "hot running" emissions.  As expected!
      
[22] (p. 169) The reasons for these differences are not apparent, but it is clear that the relations among NOx, etOH and T50 are complex   Or nonexistent?


RESPONSE: 
When model fitting for the Bag-1 data is based on the 11-term design model, results suggest the existence of a relatively small effect for T50. This result is, of course, consistent with the impression given by a simple averaging of the data (Figure 68 subplot (h)). Nonetheless, the results for T50 are less certain than for the ethanol and aromatics effects.
      
[23] (p. 169) It may be appropriate to consider whether the Bag 3 results may be more vulnerable to measurement error attributable to low sample measurements relative to background, given the issues with measurement discussed in 6.1.1 (page 116).  Yes
      
[24] (p. 183) Consider showing quantitative parameter changes and percent change results for the models?)  Good idea

RESPONSE:
Investigating the possibility of portraying model coefficients as percent changes in emissions relative to percent changes in fuel properties showed that the coefficients could not be validly portrayed in this manner in general, without reference to specific fuels.
      
 [25] (p. 211) Detailed results, including models fit, fitting histories, coefficients and tests of effect are presented in Appendices Q.3-W.3 for Bag 1 models, and Q.4 - W.4 for Bag 2 models.  Not available for review

RESPONSE:
The material in the Appendices is identical to that presented in the body of the report, except for other compounds.
      
