cdbg banner

Technical Assistance

What & Why? :: Evaluation & Measurement :: Presentations :: Databases :: Logic Models :: Assessments :: Glossary :: Resources

What are Outcomes and How Are They Used?

Outcomes are benefits or changes for individuals or populations during or after participating in program activities. They are influenced by a program's outputs. Outcomes may relate to behavior, skills, knowledge, attitudes, values, condition, or other attributes. They are what participants know, think, or can do; or how they behave; or what their condition is, that is different following the program.

What are the different levels of OUTCOMES?
  • Short-term outcomes
    • Knowledge and skills
  • Intermediate outcomes
    • Behaviors
  • Longer-term outcomes
    • Values/beliefs, conditions and status

Why Measure OUTCOMES? There are decreasing funds for nonprofits; yet there are increasing community needs and an outcome evaluation can look at impacts/benefits to clients during and after participation in your programs. Return To Top


Evaluation and Measurement

As we all know, research is an important part of developing and maintaining an effective treatment program. This manual is designed not only to explore various assessment tools but also to introduce or refresh your knowledge about the research process; more specifically the evaluation research process. Such an evaluation can be conducted on programs, employees, or clients. Evaluation can look at specific questions such as, .How does our program impact the recidivism of adolescent females who have substance abuse problems?. or more general questions such as .What type of social activities do the clients in our program enjoy?.

Evaluation can be a threatening and uncomfortable process for some people. Many groups and organizations struggle with how to build a good evaluation capability into their everyday activities and procedures. Most agencies have incorporated research into their quality assurance program. We will talk more about this process later.

Evaluation is a methodological area that is closely related to, but distinguishable from a more traditional social research. Evaluation utilizes many of the same methodologies used in traditional social research, but because evaluation takes place within an agency such as ours. It requires group skills, management ability, political dexterity, sensitivity to multiple stakeholders and other skills that social research in general does not require. The following is a discussion of the major terms and issues in the field.

What is Evaluation?
Probably the most frequently given definition is: the systematic acquisition and assessment of information to provide useful feedback about some object (this could be a program, policy, technology, person, need, activity, etc.). Evaluation work involves collecting and sifting through data, making judgments about that data and inferring the results of that data to a program or process.

Goals of Evaluation
The generic goal of most evaluations is to provide "useful feedback" to a variety of audiences including sponsors, donors, clients, groups, administrators, staff, and other relevant constituencies. Most often, feedback is perceived as "useful" if it aids in decision-making. But the relationship between an evaluation and its impact is not a simple one. Studies that seem critical sometimes fail to influence short-term decisions, and studies that initially seem to have no influence can have a delayed impact when more congenial conditions arise. Despite this, there is broad consensus that the major goal of evaluation should be to influence decision-making or policy formulation through the provision of empirically driven feedback.

Types of Evaluation
There are many different types of evaluations depending on the object being evaluated and the purpose of the evaluation. Perhaps the most important basic distinction in evaluation types is that between formative and summative evaluation. Formative evaluations strengthen or improve the object being evaluated -- they help form it by examining the delivery of the program or technology, the quality of its implementation and the assessment of the organizational context, personnel, procedures, inputs, and so on. Summative evaluations, in contrast, examine the effects or outcomes of some object -- they summarize it by describing what happens subsequent to delivery of the program or technology; assessing whether the object can be said to have caused the outcome; determining the overall impact of the causal factor beyond only the immediate target outcomes and estimating the relative costs associated with the object.

Formative evaluation includes several evaluation types:
  • needs assessment determines who needs the program, how great the need is, and what might work to meet the need
  • evaluability assessment determines whether an evaluation is feasible and how stakeholders can help shape its usefulness
  • structured conceptualization helps stakeholders define the program or technology, the target population, and the possible outcomes
  • implementation evaluation monitors the fidelity of the program or technology delivery
  • process evaluation investigates the process of delivering the program or technology, including alternative delivery procedures

Summative evaluation can also be subdivided:
  • outcome evaluations investigate whether the program or technology caused demonstrable effects on specifically defined target outcomes
  • impact evaluation is broader and assesses the overall or net effects -- intended or unintended -- of the program or technology as a whole
  • cost-effectiveness and cost-benefit analysis address questions of efficiency by standardizing outcomes in terms of their dollar costs and values
  • secondary analysis re-examines existing data to address new questions or use methods not previously employed
  • meta-analysis integrates the outcome estimates from multiple studies to arrive at an overall or summary judgment on an evaluation question


Sampling
Sampling is the process of selecting units (e.g., people, organizations) from a population of interest so that by studying the sample we may fairly generalize our results back to the population from which they were chosen.

What is a sample? A sample is a finite part of a statistical population whose properties are studied to gain information about the whole(Webster, 1985). When dealing with people, it can be defined as a set of respondents(people) selected from a larger population for the purpose of a survey.

What is a population? A population is a group of individuals, persons, objects, or items from which samples are taken for measurement. For example, if you were looking at a substance abuse program, the population would be all the clients in that substance abuse program.

What is sampling? Sampling is the act, process, or technique of selecting a suitable sample, or a representative part of a population for the purpose of determining parameters or characteristics of the whole population.

What is the purpose of sampling? To draw conclusions about populations from samples, we must use inferential statistics, which enables us to determine a population.s characteristics by directly observing only a portion (or sample) of the population. We obtain a sample rather than a complete enumeration (a census ) of the population for many reasons. Obviously, it is cheaper to observe a part rather than the whole, but we should be prepared to cope with the dangers of sampling. In this tutorial, we will investigate various kinds of sampling procedures. Some are better than others but all may yield samples that are inaccurate and unreliable. The dangers can be minimized, but some potential error is the price paid for the convenience and savings that samples provide.

What is the difference between probability(random) and non-probability(non-random) sampling? The difference between non-probability and probability sampling is that non-probability sampling does not involve random selection and probability sampling does. Does that mean that non-probability samples aren't representative of the population? Not necessarily. But it does mean that non-probability samples cannot depend upon the rationale of probability theory. At least with a probability sample, the researchers know the odds or probability that the population is represented. In general, researchers prefer probability or random sampling methods to non-random ones, and consider them to be more accurate and rigorous. However, in applied social research there may be circumstances where it is not feasible, practical or theoretically sensible to do random sampling.

Random Sampling
This may be the most important type of sample. A random sample allows a known probability that each elementary unit will be chosen. For this reason, it is sometimes referred to as a probability sample. This is the type of sampling that is used in lotteries and raffles. For example, if you want to select 10 players randomly from a population of 100, you can write their names, fold them up, mix them thoroughly then pick ten. In this case, every name had any equal chance of being picked. Random numbers can also be used (see Lapin page 81).

Non-Random Sampling
Purposeful sampling selects information rich cases for in-depth study. Size and specific cases depend on the study purpose. They are briefly described below for you to be aware of them. The details can be found in Patton(1990)Pg 169-186.

Sample Size
Using a sample in research saves on money and time. In order to reduce sampling errors the researcher should use a suitable sampling strategy and an appropriate sample size. A sample should yield valid and reliable information. Sample size is symbolized in research articles or reports as the letter "N."

The question of sample size can be a difficult one. Sample size can be determined by various constraints. For example, the available funding may pre-specify the sample size. When research costs are fixed, a useful rule of thumb is to spend about one half of the total amount for data collection and the other half for data analysis. This constraint influences the sample size as well as sample design and data collection procedures. In general, sample size depends on the nature of the analysis to be performed, the desired precision of the estimates one wishes to achieve, the kind and number of comparisons that will be made and the number of variables that have to be examined.

Measurement
Measurement is the process of observing and recording the observations that are collected as part of a research effort. There are two major issues that will be considered here. First, one must understand reliability of measurement, including consideration of true score theory and a variety of reliability estimators. Second, one must understand the different types of measures that you might use in social research. Four broad categories of measurements are usually considered:
  1. Survey research includes the design and implementation of interviews and questionnaires;
  2. Scaling involves consideration of the major methods of developing and implementing a scale;
  3. Qualitative research provides an overview of the broad range of non-numerical measurement approaches; and
  4. An unobtrusive measure presents a variety of measurement methods that don't intrude or interfere with the context of the research.


Reliability
What is Reliability? Reliability is the consistency of your measurement, or the degree to which an instrument measures the same way each time it is used under the same condition with the same subjects. In short, it is the repeatability of your measurement. A measure is considered reliable if a person's score on the same test given twice is similar. It is important to remember that reliability is not measured, it is estimated. There are two ways that reliability is usually estimated: test/retest and internal consistency.

Test/Retest
Test/retest is the more conservative method to estimate reliability. Simply put, the idea behind test/retest is that the score on test 1 should be the same as the score on test 2. The three main components to this method are as follows:
  1. implement your measurement instrument at two separate times for each subject;
  2. compute the correlation between the two separate measurements; and
  3. assume there is no change in the underlying condition (or trait you are trying to measure) between test 1 and test 2.

Internal Consistency
Internal consistency estimates reliability by grouping questions in a questionnaire that measure the same concept. For example, you could write two sets of three questions that measure the same concept (say class participation) and after collecting the responses, run a correlation between those two groups of three questions to determine if your instrument is reliably measuring that concept.

One common way of computing correlation values among the questions on an instrument is by using Cronbach's Alpha. In short, Cronbach's alpha splits all the questions on your instrument every possible way and computes correlation values for them all (we use a computer program for this part). In the end, your computer output generates one number for Cronbach's alpha and, just like a correlation coefficient, the closer it is to one, the higher the reliability estimate of your instrument. Cronbach's alpha is a less conservative estimate of reliability than test/retest.

The primary difference between test/retest and internal consistency estimates of reliability is that test/retest involves two administrations of the measurement instrument, whereas the internal consistency method involves only one administration of that instrument.

Validity
Definition: Validity is the strength of our conclusions, inferences or propositions. More formally, Cook and Campbell (1979) define it as the "best available approximation to the truth or falsity of a given inference, proposition or conclusion." In short, were we right?

Types of Validity:
There are five types of validity commonly examined in social research.
  1. Predictive validity- Does your assessment predict future behavior/attitudes? For example, if a client scores high on a risk assessment instruments, can the researcher predict that that client will behavior in a high-risk manner in the future? If so, then the instrument predictive validity.
  2. Concurrent validity- Does the assessment score concur with other things that go along with behavior or attitudes? For example, if someone scores low on an assessment measuring depression and yet they sleep a lot, have trouble eating, or withdraws from friends and family, then that assessment may not have concurrent validity. The score does not correspond with a majority of their behaviors that are symptoms of depression.
  3. Construct validity is an assessment of how well you translated your ideas or theories into actual programs or measures. Construct validity is actually comprised of two types: Convergent Validity and Discriminant Validity.
    • Convergent validity examines the degree to which the measures are similar to (converges on) other measures that are theoretically similar. For instance, to show the convergent validity of a Head Start program, we might gather evidence that shows that the program is similar to other Head Start programs.
    • Discriminant validity examines the degree to which the measures are not similar to (diverges from) other measures that theoretically are not similar. For instance, to show the discriminant validity of a Head Start program, we might gather evidence that shows that the program is not similar to other early childhood programs that don't label themselves as Head Start programs.
  4. Internal Validity asks, if there is a relationship between the program and the outcome we saw, is it a causal relationship? For example, did the attendance policy cause class participation to increase.
  5. External validity refers to our ability to generalize the results of our study to other settings. In our example, could we generalize our results to other classrooms?
Reliability estimates the consistency of your measurement, or more simply, the degree to which an instrument measures the same way each time it is used in under the same conditions with the same subjects. Validity, on the other hand, involves the degree to which the measurements are accurate.

Return To Top


Presentations


A presentation was created by the Center for Urban Studies regarding outcomes monitoring. This presentation is available for download.

Download the CDBG Outcomes Training.

Download the CDBG and Performance Measurement Training.
Return To Top


Databases

A list of books and other resources that may be helpful when using SPSS has been compiled and is available for download.

Download the Introduction to SPSS.

A brief SPSS training has also been created and is available for download.

Download the SPSS Training.
Return To Top


Logic Model

Logic models have been an essential tool when developing and monitoring outcomes. A program logic model is a systematic, visual way to present a planned program with its underlying assumptions and theoretical framework. It is a picture of why and how you believe a program will work.

Logic models are tools for program planning, management, and evaluation. They can be used at any point in the evolution of a program and can lead to better programs. Program logic models describe the sequence of events for bringing about change and relate activities to outcomes

Download the Logic Model Worksheet.

Source: Measuring Program Outcomes: A Practical Approach. United Way of America, 1996 Return To Top


Assessments

The following pages have various assessment tools. Please note that this is not an exhaustive collection nor are they necessarily the .best. of their category.

Buros Institute
This website allows you to search by topic or by name of assessment for reviews of assessment. It will give you the title, author, purpose of assessment, publisher and publisher.s address. You can purchase a full review of each assessment for $15 per assessment. Note that these reviews are descriptions and evaluations of the tests, not the actual tests themselves. To purchase the actual test materials, you will need to contact the test publisher(s).

Chipts
The Center for HIV Identification, Prevention, and Treatment Services (CHIPTS) is a collaboration of researchers who want to enhance the collective understanding of HIV research and to promote early detection, effective prevention, and treatment programs for HIV. This website will allow you to search by topic and (when available) will give you the assessment name. background of the scale (i.e. # of items and what it was designed for), the assessment developer, any copyright information and who has it copyrighted, the psychometric measures (i.e. reliability and validity), the actual assessment items, how to score the assessment, and any related references.

Assessment Publishers
Directories of test publishers are included in most major testing reference books (MMY, Tests, TIP). The size and scope of the directory usually reflects how many tests are included in that book. For example, TIP provides brief information on the greatest number of commercially available tests and, thus, has an extensive publisher directory. The Test Collection at Educational Testing Service (ETS) has a free pamphlet entitled Major U. S. Publishers of Standardized Tests, which lists the names, addresses, and phone numbers of 28 major test publishers. Call or write to them for your free copy at ETS, Library, Rosedale Road, Princeton, NJ, 08541, (609) 734-5667.

Assessment References
  • Tests in Print. (TIP) Publisher: The Buros Institute for Mental Measurements, Lincoln, NE. Most current volume: 4th ed. (1994).
  • Mental Measurements Yearbook. (MMY) Publisher: The Buros Institute for Mental Measurements, Lincoln, NE. Most current edition: 13th ed. (1998).
  • Tests. Publisher: Pro-Ed, Inc., Austin, TX. Most current edition: 4th ed. (1997).
  • Test Critiques. Publisher: Pro-Ed, Inc., Austin, TX. Most current edition: updated annually.
  • Directory of Unpublished Experimental Measures. Publisher: William C. Brown Publishers, Dubuque, IA. Editors: Bert A. Goldman & David F. Mitchell. Most current volume: 7 (1997).
  • Measures for Psychological Assessment: A Guide to 3,000 Original Sources and Their Applications. Publisher: Institute for Social Research, Ann Arbor, MI. Editors: K. T. Chun, S. Cobb, & J. R. P. French, Jr. Most current volume: 1975.
Download the General Rating Criteria for Evaluating Scales.
Download Things to Consider When Evaluating an Assessment Tool.
Return To Top


Glossary

Causal relationship - a relationship where variation in one variable causes variation in another.

Concurrent validity - the ability to distinguish between groups that should be theoretically distinguishable.

Content validity - whether or not your instrument reflects the content you are trying to measure.

Convergent validity - measures that should be related are related. Discriminant validity - measures that should not be related are not.

Correlation - a measure of the association between two variables, closer to 1 means a stronger correlation.

Covariation - a measure of how two variables both vary relative to one another.

Deviation - the difference of a score from the mean.

Error Component - the part of the variance of an observed variable that is due to random measurement errors.

Face Validity - addresses whether or not a measurement instrument is valid on its face.

Hypothesis - a theory or prediction made about the relationship between two variables.

Interaction - when the effect of one variable (or factor) is not the same at each level of the other variable (or factor).

Linear Correlation - a statistical measure of the strength of the relationship between variables (e.g., treatment and outcome). The closer the coefficient is to +1 or -1, the stronger the relationship - a positive correlation implies a direct relationship between the variables, a negative correlation implies an inverse relationship.

Linear Regression - the prediction equation that estimates the value of the outcome variable ("y") for any given treatment variable ("x").

Main Effect - the effect of a factor on the dependent variable (response) measured without regard to other factors in the analysis.

Mean - the average of your sample, computed by taking the sum of the individual scores and dividing them by the total number of individuals (sample size, "n").

Median - if you rank the observations according to size, the median is the observation that divides the list into equal halves.

Mode - the observation that occurs most frequently.

Null Hypothesis - the prediction that there is no relationship between your treatment and your outcome.

Predictive validity - the ability to predict something you want to predict.

Random sample - a sample of a population where each member of the population has an equal chance of being in the sample.

Significance level - the probability of finding a relationship between your treatment and effect when there isn't one in reality.

Type I Error - rejecting the null hypothesis when it is true.

Type II Error - accepting the null hypothesis when it is false.

Variation - a measure of the spread of the variable, usually used to describe the deviation from a central value (e.g., the mean). Numerically, it's the sum of the squared deviations from the mean.
Return To Top


Resources

Coming Soon!
Return To Top