Paper-4 Add-In Macros for Privacy-Preserving Distributed Logrank Test Computation

download Paper-4 Add-In Macros for Privacy-Preserving Distributed Logrank Test Computation

of 6

Transcript of Paper-4 Add-In Macros for Privacy-Preserving Distributed Logrank Test Computation

  • 7/28/2019 Paper-4 Add-In Macros for Privacy-Preserving Distributed Logrank Test Computation

    1/6

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6ISSN: 1837-7823

    * This paper was supported by NSF CNS-0845149 and CCF-0915374. Part of the results were presented at

    [UNESST 2012]

    Add-in Macros for Privacy-preserving Distributed Logrank Test

    Computation*

    Yu Li1

    and Sheng Zhong1Department of Computer Science and Engineering, the State University of New York at

    Buffalo, Buffalo, NY USA 14260

    Abstract

    Survival analysis is frequently used for dealing with survival outcomes in biological organisms. However it

    is a tedious process to compare survival curves step by step. In this study, we designed and developed a user-

    friendly, cloud based Microsoft Excel privacy-preserving program, named Scorpio, for incorporation of

    electronic health care using privacy preserving logrank test model.

    Keywords: Survival curves, Logrank test, Privacy preserving, Excel

    1. Introduction

    In modern society, people care about their privacy issues increasingly more with the development of

    information technology. In hospital, patients will have their own medical records stored in the computer, so that

    biomedical scientists can use this information to do some research. These records will include the medical

    history of patients such as laboratory test results and medications prescribed. In order to prevent the leak of

    personal electronic health record, the federal Health Insurance Portability and Accountability Act (HIPAA) has

    set a national standard to protect privacy of this kind of information. Since the explosive growth of medical

    research in recent years, biomedical scientists have come up with the idea of using these electronic medical data

    for incorporate research. However, the privacy and security issue still has been the most concerned thing that

    impedes such kind of incorporate research. For this reason, with the development of information and cryptograph

    technology, there is a trend that using computer methods and programs to help medical scientists to solve the

    privacy issue without revealing patients information to others. Survival analysis is also called time to event

    analysis. Survival analysis is very useful for studying different kinds of event like disease onset, earthquakes,

    stock market crash [1]. Survival analysis can be used to predict after observing a set of individuals at some

    specifically time point and continuous monitoring them for fixed intervals of time. Therefore, how to build a

    survival analysis model is the most critical component to get a better prediction. In biomedical field, survival

    analysis mainly means observing time to death of experimental subject. Obviously, If having more experiment

    data that we used for training we can get a more precise model. Therefore, biomedical researchers want to

    combine the data from different institutes to build a better survival analysis model, especially survival function

    comparison models [2]. For the privacy and security issues, computer scientist can use privacy preserving

    method to protect the data from revealing to anyone. In order to compare the survival curves without revealing

    the data, [2] has come up with a privacy preserving model that can protect the data privacy.

    However it is a tedious process to compare survival curves step by step. In medical area, Microsoft Excel is

    widely used due to its friendly user-interface and easy operation. Compared with other statistical computing

    softwares like SAS and SPSS etc, although most of these softwares have a strong data management ability, the

    usage of them will be complicated for medical people who has not been training professionally. Microsoft Excelhas been widely applied in Medical institutes no matter it is used for store experimental data or creates survival

    curves. It can help medical scientists to analyse and make better decisions. Besides these, Microsoft Excel has a

    strong ability to let VBA (Visual Basic for Applications) or Macro develop programs to control Excel. Therefore,

    most of biomedical scientists are more willing to use Microsoft Excel to store the data that get from the

    experiment. Consequently, many scientists have developed programs which can apply to Microsoft Excel

    immediately and automatically. In [3], Hitoshi Sato presented a package of macro programs (named PK

    MOMENT) to automatically calculate non-compartmental pharmacokinetic parameters on Microsoft Excel

    spreadsheet. In [4], Zhang presents PKSolver, a freely available menu-driven add-in program for Microsoft

    Excel written in Visual Basic for Applications (VBA), for solving basic problems in pharmacokinetic (PK) and

    pharmacodynamic (PD) data analysis. In [5], Brown presented a simple, easily understood methodology for

    solving biologically based models using a Microsoft Excel spreadsheet. In [6], a user-friendly, inexpensive

    EXCEL-based program to find potential phosphorylation sites in proteins is presented.

    In this paper, we develop a user-friendly, cloud based Microsoft Excel privacy-preserving program, named

    Scorpio, for incorporation of electronic health care using privacy preserving logrank test model. Since the

  • 7/28/2019 Paper-4 Add-In Macros for Privacy-Preserving Distributed Logrank Test Computation

    2/6

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6ISSN: 1837-7823

    37

    program does not require any programming skills or any use of VBA or Macro language. Once the data from all

    institutes are ready, the program can be run automatically. In the rest of this paper, we describe the method of

    creating privacy preserving comparison test of survival curves, especially data store and collection method as

    well as the design and implementation of our program.

    2. MethodsLogrank test is a standard comparison test of survival curves. When a research institute wants to raise a

    computation for logrank test, he needs to collect data from different medical institutes. However, some medical

    data are very sensitive. How to compute the logrank test without revealing these data to other people who does

    not own is a big issue. In [2], the authors have come up with a privacy preserving secure sum method which

    generate an initial random number and add it to the first medical institutes data. Here, we introduce their method

    briefly. They suppose there are n groups of individuals.

    Table 1: Summary of Denotations for Logrank Test

    : the number of individuals that are alive in group k at the beginning of time interval j.: the number of events occurring in group k in interval j.

    : the number of observed deaths in group k.: expected number of deaths in group k.

    The finalZis the logrank test result. A smallerZindicates that the hypothesis has a higher probability that is

    true. In [2], the authors assume there are s parties (s > 3) involved in this logrank test computation. Theyprovided a privacy preserving method that let the first institute who participate this survival analysis

    computation add a random number to its data. The range of the random number should as same as and .

    Then pass it to the next participant. Similarly, every other participant adds its local value to the sums that it

    receives and sends the new sums to the next party. Finally, the first institute can get the sum and calculate the

    logrank test with the random number he already knows. In this process, actual values of and are hidden

    behind the random numbers [2].

    Based on this privacy preserving model, we design a program that can automatically collect data from each

    participate medical institute and add these data to the initial file immediately. After collecting all the data, the

    program then calculate the quotient of the number of events occurring divided by the number of individuals that

    are alive in each interval. Then each medical institute can get the value automatically. After that each institute

    can calculate the expected number of deaths and logrank test statistic automatically. Then we let the program

    repeat the method again that add another random number to the first medical institutes logrank test result and

    add up all these result. Then first institute who rise up the comparison can get the final logrank test statistic and

    inform all other participants.

  • 7/28/2019 Paper-4 Add-In Macros for Privacy-Preserving Distributed Logrank Test Computation

    3/6

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6ISSN: 1837-7823

    38

    Specifically we use cloud-based storage to collect the data from each institute. Cloud-based storage can let

    everybody who has the permission reach the file from anywhere. In this part, as shown in figure 1, we first let

    party 1 add a random number on its data and upload the file into the server, then party 2 download the file and

    add its own data on the existing data, then upload the file to the server. Go on like this until the last party done.

    Therefore the first party can get the sum of actual data after minus the random number. After that program can

    automatically call Microsoft Excel Macro we developed to calculate the value we need. After that party 1 can getthe final logrank test statistic result and let other participated institutes know.

    3. Program Description

    3.1 Software Design

    The program is developed using C# combined with Microsoft Excel VBA which is universal available and

    very convenient in Bio-medical research institute. We assume every medical institute uses Microsoft Excel to

    store the survival data. In order to protect privacy of these survival data, our program add a random number to

    the original data of first institute. Then the first institute raises the requirement of computation for the survival

    curve comparing logrank test. Our program will automatically upload the file to the server and add other

    institutes data to the existing data. Therefore, the institute participated the computation will not know others

    survival data. Although this can be done manually, it will be very tedious and waste a lot of time to click the

    button when calculate the value using Excel. However our program can easily read the input file and calculate

    the logrank survival comparison automatically without revealing data to others.

    Figure 1: The flow chart of our program

    3.2 How to use Scorpio

    After one institute sets up a server that use for store the file, the institute who wants to participate the

    logrank test calculation runs the program we developed as shown in figure 2. First, every institute should

    connect to the server. Then one medical institute who wants to raise the calculation uploads their files that has

    added a random number on the data, and chooses the participant and click the send button. Then each participant

    will receive a message in turn. After that the program will download the file and add their data on the previous

    data in the file and upload it. After all participants finishing adding their data, the first party can get the whole

    data with the random number he added.

  • 7/28/2019 Paper-4 Add-In Macros for Privacy-Preserving Distributed Logrank Test Computation

    4/6

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6ISSN: 1837-7823

    39

    Figure 2: The program user interface for privacy preserving logrank test

    3.3 Computation of survival curves comparing using logrank test

    For the computation of survival curve comparing using logrank test, after the program collecting the data

    from all medical institutes, the program minus the random number which has been added to the original data of

    first institute and get the whole alive and death number of every intervals. The program then calculates the

    summation of all alive and death number respectively.

    3.4 Program Code

    Here we list some Excel Micro we developed in our program.

    Add Random Number

    Sub AddRandomNumber()

    Range("E1").Select

    ActiveCell.FormulaR1C1 = "=RANDBETWEEN(1,20)"

    Range("F1").Select

    ActiveCell.FormulaR1C1 = "Random Number for d"

    Range("E1").Select

    Selection.Copy

    Range("F2:F12").Select

    Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _

    :=False, Transpose:=False

    Range("E1").Select

    Application.CutCopyMode = False

    ActiveCell.FormulaR1C1 = ""

    Range("E1").Select

    ActiveCell.FormulaR1C1 = "dj+RND"

    Range("E2").Select

    ActiveCell.FormulaR1C1 = "=RC[-3]+RC[1]"

    Range("E2").Select

    Selection.AutoFill Destination:=Range("E2:E12"), Type:=xlFillDefault

    Range("E2:E12").SelectRange("H1").Select

  • 7/28/2019 Paper-4 Add-In Macros for Privacy-Preserving Distributed Logrank Test Computation

    5/6

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6ISSN: 1837-7823

    40

    ActiveCell.FormulaR1C1 = "=RANDBETWEEN(1,20)"

    Range("I1").Select

    ActiveCell.FormulaR1C1 = "Random Number for n"

    Range("H1").Select

    Selection.Copy

    Range("I2:I12").Select

    Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks

    :=False, Transpose:=False

    Range("H2").Select

    Application.CutCopyMode = False

    ActiveCell.FormulaR1C1 = ""

    Range("H1").Select

    ActiveCell.FormulaR1C1 = "nj+RNN"

    Range("H2").Select

    ActiveCell.FormulaR1C1 = "=RC[-5]+RC[1]"

    Selection.AutoFill Destination:=Range("H2:H12"), Type:=xlFillDefault

    End Sub

    Compute Ek

    Sub ComputeE()

    Range("M1").SelectApplication.CutCopyMode = False

    ActiveCell.FormulaR1C1 = "E"

    Range("M2").Select

    ActiveCell.FormulaR1C1 = "=R[-1]C[-10]*R[-1]C[-2]"

    Range("M2").Select

    ActiveCell.FormulaR1C1 = "=RC[-10]*RC[-2]"

    Range("M2").Select

    Selection.AutoFill Destination:=Range("M2:M12"), Type:=xlFillDefault

    End Sub

    4. Samples of Program Runs

    The medical scientists usually prefer to use Microsoft Excel to store the data that gets from experiment.

    They also care about the privacy issue when they want to combine the data from different medical institute to do

    some research. The Scorpio program is specially designed for medical scientists to combine their survival data to

    generate comparing survival curves using logrank test. The input data is as figure 3 shows. The medical

    scientists just only need to type the alive and death number into different time intervals. After the program

    collect all required data from other institutes, the first party use the macro we provide can get the final logrank

    test statistic result as figure 4 shows.

  • 7/28/2019 Paper-4 Add-In Macros for Privacy-Preserving Distributed Logrank Test Computation

    6/6

    International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6ISSN: 1837-7823

    41

    Figure 3: Original data owned by each institute which should be keep confidential from revealing to other parties

    Figure 4: The final result of privacy-preserving logrank test statistic

    5. Hard ware and software specifications

    An Intel CORE i5 computer (2GB RAM) running under windows 7 operating system was used. The

    program was developed using Microsoft Excels macro language in Excel 2010 platform.

    6. Conclusion

    In this paper, we have designed a Microsoft Excel Macro based privacy preserving program for survival

    curves comparison using logrank test. In order to make it easy to use and protect the data privacy, the program

    can be applied to Microsoft Excel immediately which is widely used by clinics and biomedical scientists. The

    program also can protect privacy of the data by adding random number to the original data. Experiments on the

    real medical data have shown the effectiveness of our proposed program.

    References

    [1] Allison, P.D. (2010) Survival analysis using SAS: A practical guide, SAS publishing.

    [2] Chen, T. and Zhong, S (2011) Privacy-Preserving Models for Comparing Survival Curves Using the

    Logrank Test, Computer methods and programs in biomedicine.

    [3] Sato, H. and Sato, S. and Wang, Y.M. and Horikoshi, I. (1996) Add-in macros for rapid and versatile

    calculation of non-compartmental pharmacokinetic parameters on Microsoft Excel spreadsheets., Computer

    methods and programs in biomedicine.50,1,43-52.

    [4] Zhang, Y. and Huo, M. and Zhou, J. and Xie, S.(2010) PKSolver: An add-in program for pharmacokinetic

    and pharmacodynamic data analysis in Microsoft Excel. Computer methods and programs in biomedicine.

    99,3,306-314.

    [5] Brown, M. (1999) A methodology for simulating biological systems using Microsoft Excel. Computer

    methods and programs in biomedicine. 58,2,181-190

    [6] Wera, S. (1998): An EXCEL-based method to search for potential Ser/Thr-phosphorylation sites in proteins.

    Computer methods and programs in biomedicine. 58,1,65-68

    [7] Li, Y and Zhong, S. (2012) Scorpio: A simple, convenient, Microsoft Excel Macro based program for

    privacy-preserving logrank test. Computer Applications for Database, Education, and Ubiquitous Computing.

    86-91