ABSTRACT
Survival analysis is a class of statistical methods designed to study the occurrence and timing of events. This study aimed to compare the results of semi-parametric Cox proportional hazards model and parametric models (Weibull and Gompertz) to determine the model that best fits breast cancer data. Kaplan-Meier (K-M) method was used to estimate and graph the survival curves using the data obtained from Ahmadu Bello University Teaching Hospital Zaria on 312 breast cancer patients between 1997 and 2012. The mean age of the breast cancer patients was found to be 43.39 years with standard deviation of 11.74 years and overall median survival time of 10 months. The 5-year overall survival rate was about 35.0%. In comparing the semi-parametric Cox model and parametric (Weibull and Compertz) models, Akaike Information Criterion (AIC) was used to evaluate the three models. Weibull regression model had the least AIC value (422.60) which shows best performance in handling breast cancer data, where as Cox regression model has the highest AIC value (530.65) followed by Gompertz model with AIC value (430.28). From the results of the analysis obtained, for Cox, Weibull and Gompertz regression models, age, occupation and stage II of the breast cancer does not have significant effect on the mortality of the patients, (p = 0.0440, 0.0270, 0.1740 respectively) but results of the treatment and stage III of breast cancer have significant effect on the mortality of the patients, (p = 0.0001, 0.00001 respectively). p < 0.01 is considered as statistical significant. The results of this study showed that, according to our breast cancer data, the parametric Weibull regression model could better determine the factors associated with the breast cancer disease than the semi-parametric Cox proportional hazards model. That is, Weibull
vii
model provided a better fit to the study data than the Cox proportional hazards and Gompertz models.
TABLE OF CONTENTS
Cover page…………………………………………………………………………………….i
Title page……………………………………………………………………………………..ii
Declaration …………………………………………………………………………………..iii
Certification …………………………………………………………………………………iv
Acknowledgement ……………………………………………………………………………v
Abstract ……………………………………………………………………………………..vi
Tale of contents …………………………………………………………………………….vii
List of Tables ………………………………………………………………………………….x
List of Figures……………………………………………………………………………….xi
CHAPTER ONE: BACKGROUND TO THE STUDY
1.1 Introduction………………………………………………………………………….1
1.2 Statement of the problem……………………………………………………………5
1.3 Aim of the Study……………………………………………………………………..5
1.4 Objectives of the study……………………………………………………………….5
1.5 History of Cancer Disease……………………………………………………………5
1.5.1 Causes of Cancer Disease…………………………………………………………….6
1.5.2 Those at Risk of Cancer………………………………………………………………7
1.5.3 Types of Cancer………………………………………………………………………7
1.5.4 Ways to Reduce the Risk of Cancer…………………………………………………7
viii
1.5.5 Diagnosis of Cancer …….…………………………………………………………..8
1.5.6 Treatment of Cancer ……….…………………………………………………………9
1.6 Breast Cancer……………………………………………………………………….11
1.7 Basic Concepts in Survival Analysis……………………………………………..12
1.7.1 Survival Function……………………………………………………………..…….12
1.7.2 Hazard Function……………………………………………………………………13
CHAPTER TWO: LITERATURE REVIEW
2.1 Introduction…………………………………………………………………………15
2.2 Non Parametric Approach……………………………………………………………17
2.3 Comparison of Survival Distribution……………………………………………….22
2.4 Semi-Parametric Approach …………………………………………………………25
2.5 Parametric Approach………………………………………………………………..30
2.5.1 Weibull Regression Model………………………………………………………….30
2.5.2 Gompertz Regression Model………………………………………………………..32
CHAPTER THREE: METHODOLOGY
3.1 Introduction …………………………………………………………………………35
3.2 Kaplan-Meier Estimator……………………………………………………………..35
3.3 Comparison of the Survival Curves of Two Groups Using the Log-rank Test……..37
3.4 Cox Proportional Hazard Model……………………………………………………38
3.5 Cox Proportional Hazard Model Partial Likelihood Estimation……………………39
3.6 Weibull Regression Model………………………………………………………….41
3.7 Gompertz Regression Model………………………………………………………..43
3.8 Akaike Information Criterion (AIC)………………………………………………..46
ix
CHAPTER FOUR: ANALYSIS AND DISCUSSION OF RESULTS
4.1 Introduction…………………………………………………………………………47
4.2 Kaplan –Meier (K-M) Survival Function……………………………………………47
4.3 Log- rank Tests for Equality of Survival Functions (Survival Probabilities)…..…50
4.4 Test for Proportional Hazard Assumption…………………………………………..53
4.5 Results from the Analysis of Cox Proportional Hazard Model……………………..57
4.6 Results from the Analysis of Weibull Model……………………………………….59
4.7 Results from the Analysis of Gompertz Model……………………………………..61
CHAPTER FIVE: SUMMARY, CONCLUSION AND RECOMMENDATIONS
5.1 Summary……………………………………………………………………………65
5.2 Conclusion…………………………………………………………………………..66
5. 3 Recommendations…………………………………………………………………..68
5.4 Suggestions for Further Research…………………………………………………..68
References ………………………………………………………………………………….69
Appendix I…………………………………………………………………………………..77
CHAPTER ONE
1.1 Introduction
Survival analysis is a class of statistical methods designed to study the occurrence and timing of events. The methods were developed for studying the time from initiating events (such as birth, diagnosis of a disease, start of a treatment e.t.c.) to some terminal events (such as response to a treatment, relapse, death e.t.c.). These methods are mostly applied in medical sciences. However, the methods were successfully applied to many different kinds of events across disciplines. Examples include the manufacturing or engineering sectors (how long it takes a machine to fail), social sciences (how long a marriage will last), finance (the time it takes the price of stock in stocks market to drops) and so on. Sometimes other names are used to refer to this class of methods such as “duration analysis”, “failure time analysis”, “event history analysis”, “transition analysis”, but the basic techniques are the same as in the underlying ideas i.e. understanding the pattern of events in time and what factors cause those events to occur.
Survival analysis measures the probability of how long it takes a given outcome to occur for a group of similar individuals known as failure times (Cox and Oates, 1984). Survival analysis method also measures the probability that the given event in question will last the next point in time (Gokovali et al, 2007). Survival analysis studies are well known to occur in medical sciences particularly in cancer studies. Depending on the question of the studies, one is interested in the risk groups i.e. which group of patients or components is more likely to experience the events; or time predictions. The starting time from a defined point to the occurrence of a given event is called survival time (Berwick et al, 2004) and the analysis of such group data is called survival analysis (Altman, 1977).
2
In order to understand the survival analysis; the concept of event has to be understood. According to Blossfeld et al, (1989) an event is a change in state as defined by one or more qualitative variables within some observational period and within the relevant state space. Examples are death, marriage, promotion e.t.c. Survival analysis is the core of any study of time to a particular event, such as death, infection or diagnosis of a particular disease. It is therefore fundamental to most epidemiological cohort studies. Survival data can be thought of as being generated by what is called a “failure time process”. A failure time process consist of units, such as individuals, government, countries and so on that are observed at some starting point in time. These units are in some states. For example, the individual is healthy, the government is in power, a country is at peace and so on, and is observed over time. At any given point in time, these units are at „risk‟ of experiencing some events, where an event essentially represents a change or transition to another state. For example, the individual dies, the government falls from power, a country is at war e.t.c. After the event is experienced, the unit is either no longer observed or it is at risk of experiencing another kind of event.
In some circumstances, units are not observed in experiencing an event; that is no transition is made from one state to another while the unit is being observed. For example, the individual remains healthy, the government remains in power, the country remains at peace and so on. These cases are called „censored‟ since we do not observe the subsequent history of the unit after the last observation point. Censoring is endemic to survival analysis data. It is defined as the loss of observation on the life time variable of interest in the process of an investigation. In survival data, censoring frequently occurs for many reasons. In clinical trial, the effectiveness of a new medical treatment for a particular disease is
3
investigated. A patient for example, may be lost to follow up due to migration or health problems. Censoring occurs when a subject does not have an event during the observation time, meaning that we cannot observe what has happened subsequently on the subject. According to Klienbaum and Klien (2005), there are generally three reasons why censoring may occur; when a person does not experience the event before the study ends; when a person is lost to follow-up during the study period; when a person withdraws from the study because of death (if death is not the event of interest) or some other reason.
There are different types of censoring. They are right censoring, left censoring and interval censoring. Right censoring is the most common type of censoring, it occurs when the observation is terminated. In other words, it occurs when a subject leaves the study before an event occurs. For example, consider the patients in a clinical trial for the study of the effect of treatments on stroke occurrence. The study ends after five years, those patients who have no strokes by the end of the five years are said to be right censored. Also, in a study of divorce, couples that do not divorce during the observational period are right censored. Left censoring rarely occur, it happens when an event is known to have occurred before some particular time, but the exact time is unknown. For example, infection with some sexually transmitted disease such as HIV/AIDS.
Another example is in a study of first marriage, if a person is known only to have married before age 20, that person‟s marriage age is left censored. For interval censoring, if an individual or subject is known to have experienced an event within an interval of time but the actual survival time is not known, then an interval censoring is said to have occurred. The actual occurrence time of an event is known within an interval of time. For
example, infection with sexually – transmitted diseases such as HIV/AIDS with regular testing e.g. annually.
4
Another feature of survival data is” Truncation”. Truncation refers to complete lack of information about the occurrence of the event. It is a condition other than the event of interest that is, for example, used to screen respondents or patients (Klein and Moeschberger, 1997). There are three types of truncation. They are right, left and interval truncation. The most common type of truncation is left-truncation, this occurs when subjects enter the study at a random age. In the case of left truncation, we do not have information before the onset of risk to sometime after the risk. In other words, the subject was not observed for some time at the beginning of the process but then came under observation.
Right-truncation occurs, but is less frequent. Klein and Moeschberger (1997) provided the example of the examination of an episode from HIV infection until the development of AIDS. If the sample only includes those who have developed AIDS prior to the end of the study, those HIV-infected individuals who have not yet progressed to AIDS are excluded from the sample. Another type of truncation is interval or gap truncation, which is similar to left-truncation. This could occur in a clinical study if, for example, a patient is under observation for the first 3 months of the study, drops out for 2 months and then rejoins the study again for the last 7 months. Dropping out of the study for 2 months creates an interval or gap in the period of observation. Both left and interval truncations are dealt with during analyses by omitting the subject from all individual in the study (in a binary-outcome analysis) during the truncation period (or gap) due to the fact that they could not have experienced the event during those periods.
5
1.2 Statement of the problem
Cox Proportional Hazard Model is a semi-parametric model, whereas Weibull and Gompertz models are parametric models. These models are mostly used in the analysis of survival data in clinical research. It has not been established which model among the three that is most efficient in the analysis of survival data. This study therefore, compared the three models towards establishing the one that would be most suitable for the analysis of survival data in clinical research.
1.3 Aim of the Study
Cancer disease differs, as the models used for cancer studies also differs. This study aim to investigate semi-parametric Cox proportional hazard model and parametric (Weibull and Gompertz) models in assessing the best comparative model that fits breast cancer data in clinical research.
1.4 Objectives of the study
The following are the objectives of the study:
-To describe the survival function using Kaplan-Meier (K-M) approach, and compare the survival curves using Log-rank tests.
-To fit the three models used in the survival analysis using data on breast cancer.
-To evaluate the models used in the study of survival analysis using model comparison.
1.5 History of Cancer Disease
6
Human cancer is probably as old as human race. The word cancer came from the father of medicine, Hippocrates a Greek physician. Hippocrates used the Greek words, „carcinos‟ and „carcinoma‟ to describe tumour thus, calling cancer “karkinos”. These Greek
terms actually were words that describe a “crab”, which Hippocrates thought a tumour resembled. Tumour is abnormal mass of new tissue growing in or on the body, causing
cancer disease; it destroys parts of the body, often causing death. Although, Hippocrates was certainly not the first to discover the disease.
The history of cancer began much earlier. Cancer is one of the ten causes of death of the world population. World Health Organisation (WHO)’s International Agency for Research on Cancer also found the number of cancer-related deaths increased from 7.6 million in 2008 to 8.2 million in 2012. The most commonly diagnosed types of cancer were lung cancer, breast cancer and colorectal cancer, which made up 13 percent, 11.9 percent and 9.7 percent of the total, respectively
1.5.1 Causes of Cancer Disease
Normally, the body makes new cells when they are needed for the growth or repair of damage parts of the body. This cells growth is controlled by a group of genes that works together. When one or more of these genes is damaged (or mutated), too many cells are produced. These excess cells clump together to form a tumour resulting in cancer. Genes can be mutated by bad chemicals in a person‟s environment , like tobacco smoke, but more often, damage to growth control genes occurs spontaneously in only a few cells of the body, that is why a tumour is typically found or at least begins in just one spot.
Cancer is not just one disease, but a large group of almost 100 diseases. Its two main characteristics are uncontrolled growth of cells in the human body and the ability of
7
these cells to migrate from original site and spread to distant sites. If the spread is not controlled, cancer can result in death.
1.5.2 Those at Risk of Cancer
Most people who developed cancer have no family history of the disease, meaning that their cancer was not inherited, but sometimes the more common cancer
(example breast, prostate and colon cancer) can run in a family. If a close relative, (example, grand parent, parent or sibling) has been diagnosed with a common form of cancer, you too may be at risk. Your risk increases if a relative develop the cancer at an early age i.e. before age 50 – 55 years, or if more than one close relative develop the same cancer.
1.5.3 Types of Cancer
Cancer has the potential to affect every organ in the body. The cells within malignant tumour have the ability to invade neighbouring tissues and organs, thus, spreading the disease. It is also possible for the cancerous cells to break free from the tumour and enter the blood stream, in turn spreading the disease to other organs. This process of spreading is called metastasis.
Some of the different types of cancer are:
Blood cancer, Bone cancer, Brain cancer, Breast cancer, Digestive/Gastrointestinal cancer, Endocrine cancer, Respiratory cancer, Skin cancer, Prostate cancer, Cervical cancer
1.5.4 Ways to Reduce the Risk of Cancer
The best way to reduce cancer risk may differ slightly for each type of cancer. Here are some general guidelines to follow:
-Do not use tobacco products; avoid excessive alcohol consumption (for lung cancer)
8
– perform regular self-examination.
-See your doctor for periodic screening.
-If cancer runs in your family, discuss this information with your doctor.
-Maintain a healthy body weight, avoid obesity.
-Exercise regularly.
-Eat plenty of vegetables and fruits.
-Decrease or avoid consumption of animal fats and red meat.
1.5.5 Diagnosis of Cancer
Diagnosis begins with physical examination and a complete medical history. The doctor observe, feel and palpate (apply pressure by touch) different parts of the body in order to identify any variations from the normal size. As part of the physical examination, the doctor will inspect the oral cavity, or the mouth. To detect thyroid cancer, the doctor will observe the front of the neck for swelling.
The doctor may conduct a thorough examination of the skin to look for sores that have been present for more than three weeks. Examination of the female pelvis is used to detect cancers of the ovaries, uterus, cervix and vagina. Breast examination includes visual observation where the doctor looks for any discharge, unevenness, disclouration or scaling, he palpates both breasts to feel for masses or lumps.
For males inspection of the rectum and the prostate is also included in the physical examination. The doctor inserts a gloved finger in to the rectum and rotates it slowly to feel for any growths, tumours or other abnormalities. If the doctor detects an abnormality on
9
physical examination, or the patient has some symptom that could be indicative of cancer, the doctor may order diagnostic tests.
A blood test for cancer is easy to perform, usually inexpensive and risk-free. Blood tests can be either specific or non-specific. Often, in certain cancers, the cancer cells release particular proteins (called tumour makers) and blood tests can be used to detect the
presence of these tumour makers. Blood tests are generally more useful in monitoring the effectiveness of the treatment, or in following the course of the disease and detecting re-occurrence of the disease.
1.5.6 Treatment of Cancer
Treatment and prevention of cancers continue to be the focus of a great deal in cancer research. Research in to new cancer therapies included cancer targeting gene therapy, virus therapy and a drug that stimulate apoptosis or self destruction of cancer cells, but not healthy cells. However, all of these new therapies take years of clinical testing and research.
The aim of cancer treatment is to remove all or reduce as much of the tumour as possible and prevent the recurrence or spread of the primary tumour. If the cancer is very aggressive and a cure is not possible, then the treatment should be aimed at relieving symptoms and controlling the cancer for as long as possible. Cancer treatment can take many different forms, and it is always tailored to the individual patient. The decision of which the type of treatment is the most appropriate depends on the type and location of cancer, the extent to which it has already spread the patient‟s age, sex, general health status and personal treatment preferences. The major types of treatment are:
Surgery, radiation, chemotherapy, immunotherapy, hormone therapy and bone-marrow transplantation.
10
Surgery: Surgery is the removal of a visible tumour and is the most frequently used cancer treatment. It is most effective when a cancer is small and confined to one area of the body.
Radiation: Radiation kills tumour cells. Radiation is used alone in cases where tumour is unsuitable for surgery. Most often, it is used in conjunction with surgery and chemotherapy. Radiation can either be external or internal. In the external form of radiation, the radiation is aimed at the tumour from outside the body. In internal radiation, (also known as brachytherapy), a radioactive substance in the form of pellets or liquid is paced at the cancerous site by means of a pill, injection or insertion in a sealed container.
Chemotherapy: This is the use of drugs to kill cancer cells. It destroys the hard-to-detect cancer cells that have spread and are circulating in the body.
Immunotherapy: This uses the body‟s own immune system to destroy cancer cells. This form of treatment is being intensively studied in clinical trials and is not widely available to most cancer patients.
Hormone Therapy: Hormone Therapy is a standard treatment for some types of cancer that are hormone-dependent and grow faster in the presence of a particular hormone. These include cancer of the prostate, breast and uterus. Hormone Therapy involves blocking the production or action of these hormones. As a result, the growth of the tumour slow down and survival may be extended for several months or years.
Bone marrow transplantation: The bone marrow is the tissue within the bone cavities that contains blood-forming cells. Healthy bone marrow tissue replenishes the blood supply and is essential to life. Sometimes the amount of drugs or radiation needed to destroy cancer cells also destroys bone marrow. Replacing bone marrow with healthy cells
11
counteracts this adverse effect. A bone marrow transplant is the removal of marrow from one person and the transplant of the blood-forming cells either to the same person or to someone else.
1.6 Breast Cancer
Breast cancer is a type of cancer originating from breast tissue, most commonly from the inner lining of milk ducts or the lobules that supply the ducts with milk (Sariego, 2010). Cancers originating from ducts are known as ductal carcinomas, while those originating from lobules are known as lobular carcinomas. Breast cancer occurs in humans and other mammals. While the overwhelming majority of human cases occur in women, male breast cancer can also occur. Breast cancer is the most frequent cancer in women while it is the second one in all humans. Although its prevalence changes in different societies, it is known that one out of 8-10 women in Western society would have breast cancer during their life span.
The mortality rate of breast cancer was calculated as 3%. The prevalence of the disease is especially higher in the North America and European countries than the rest of the world (Stewart and Kleihues, 2003). In 1990, the World Health Organization (WHO) conducted a study in which 796,000 breast cancer cases were reported out of which 314,000 deaths occurred. In 2002, 1,152,000 new cases and 411,000 deaths were detected in a study, conducted by International Agency on Cancer for Research (IARC) under WHO administration. According to all phases of the diseases, the five year survival rate has been reported as 73% in developed countries and 52% in developing countries. The significant difference between those two values can be explained by early diagnosis via
12
mammography scans and better treatment in developed countries. Breast cancer is a disease
found mostly in females, and it is a genetic transmission disease, causing by abnormal
hormones, obesity and food with high fats and spreading of cancer from other part of
organs to the breast (Wingo, et al 1995). Therefore breast cancer is considered as a major
health issue worldwide (Harirchi, et al 2001).
1.7 Basic Concepts in Survival Analysis
The primary concept in survival analysis is the survival time which is also the failure
time. Survival time is the length of time that is measured from time origin to the time the
event of interest occurred. Other concepts are survival and hazard functions.
1.7.1 Survival function
A typical analysis of survival data involves the modeling of time-to- event data,
such as time until death. The time to event of interest is called either survival time or failure
time. Survival function is a basic quantity in nature, employed to describe the probability
that an individual survives beyond a specific time. In other words, survival function is the
amount of time until the event of interest occurs. Survival function gives information on the
probability of a time-to-event of interest. The survival function, models the probability of an
individual surviving beyond a specific time„t‟. The statistical expression of survival function
is given by
S(t) P(T t) , t 0 (1.1)
„ T ‟ is denoted as random variable representing the survival time, which is the time
until the event of interest occurred and „ t „ is any particular value of „ T „. In other words,
the probability of experiencing the event of interest beyond time „t „is modeled by the
survival function.
13
Note that survival function is defined for both discrete and continous T, and the
probability density function and the hazard function are easily specified for discrete and
continuous T.
For discrete random variable T, taking values 1 2 0 t t … , let the probability
mass function be given by ( ) ( ); i i P T t f t i 1,2,… then the survival function is
/
( ) ( )
i
j
j t t
S t f t (1.2)
( ) ( ) j j f t I t t (1.3)
where
0, if tj>t
the indicator function ( ) j I t t
1, if tj t (1. 4)
For continuous variable T, the probability density function of T is
/ / f (t) F (t) S (t); t 0 (1.5)
1.8.2 Hazard function
The hazard function gives the instantaneous failure rate at time„t ‟ given that the
individual has survived up to time „t‟ i.e.
The hazard function h(t) 0, is given as
‘ ‘
0
in int ( , ) /
( ) lim t
p individual fails time erval t t t it survives until time t
h t
t
(1.6)
14
0
( / )
( ) lim ;
t
P t T t t T t
h t
t
t 0 (1.7)
where
t = small interval of time
There is a defined relationship between S(t) and h(t) which is given by the formular
( ) ( )
( )
( )
f t dLogS t
h t
S t dt
(1. 8)
0
( ) exp ( ) exp( ( ))
t
S t h u du H t ; t 0 (1.9)
where
0
( ) ( )
t
H t h u du is called cumulative hazard function, which can be obtained
from the survival function since h (t) = -log S(t) (1. 10)
The probability density function of T can be written as
0
( ) ( )exp ( )
t
f t h t h u du ; t 0 (1.11)
The survival function is most useful for comparing the survival progress of the two
or more groups. The hazard function gives a more useful description of the risk of failure at
any point time.
IF YOU CAN'T FIND YOUR TOPIC, CLICK HERE TO HIRE A WRITER»