
WEB-BASED SOFTWARE FOR STORAGE, STATISTICAL PROCESSING AND ANALYSIS OF SNP DATA IN STUDIES ON COMPLEX DISORDERS Betcheva E1, Betchev C2, Toncheva DI1,*
*Corresponding Author: Professor Draga Ivanova Toncheva, M.D., Ph.D., Department of Medical Genetics, Medical Faculty, Medical University, 2 Zdrave str., SBALAG “Maichin dom”, 6 Fl., 1431 Sofia, Bulgaria; Tel./Fax: +35-92-952-0357; E-mail: dragatoncheva@yahoo.com
page: 9
|
METHOD AND DISCUSSION
The application we have developed is based on a three-layer architecture model (Figure 1). The client system is an arbitrary browser (Mozilla, Internet Explorer, Opera, Safari, etc.). The WEB server is Apache on Linux deploying PHP scripting language and MySQL database. Data are transferred from a client’s computer to a host mail server as unconverted text file format with particular data structure (Figure 1).
The internet plays an important role in the development of novel systems and algorithmic models for information services application (ISA) [14]. Common internet services are as follows: electronic mail (e-mail), file transfer (ftp), and WEB hypertext transfer protocol for WEB-page browsing (http). Less popular but equally accessible are TELNET, an internet protocol for connecting to a remote server, and USENET, an on-line information interchange service. Connection and communication through the internet is possible regardless of differences in users’ platforms (operating systems and hardware) and software. This operational principle enables the establishment of a two-layer model for simple communication between Client and Server. A more complex three-layer model for data interchange is required when abundant and specific information management and its reliable storage are expected. It includes a Server-accessible database for management and storage of large amounts of structured information (Figure 1).
Purposeful use of software and technology and application of computer-based information and communication systems to achieve maximum efficiency of specific task management procedures is defined as modern information technology (IT). This enables highly effective utilization of time and resources, and also reveals new opportunities for task performance. At present, use of the internet is an integral part of contemporary IT and of virtual environment for transfer of large amounts of data. This relates to various specific activities in the field of humanities (http://mysql.com/ index.html) [16-18].
In order to reduce the cost and to preserve the power of the analysis, some authors designed the whole-genome approach in a two-step manner. A pilot fraction of samples is selected for high-throughput genotyping by microarray technology. Subsequently, a number of top markers is chosen for genotyping in a second sample set [13,15]. We have adopted such an approach for our study. Subsequently to execution of the validation and replication studies, a highly efficient and reliable statistical processing of genotyping data for 100 genetic markers in 1,000 DNA samples was required. The common procedure for data processing includes interventions such as manual transfer and conversion of text files (containing unnecessary additional information) in electronic spreadsheet (for example in Microsoft EXCEL), animated by macro commands, in order to evaluate certain quantities. Each manual procedure consumes considerable time and resources. It increases the risk of disruption by human error and of completely erroneous interpretations. The quantitative assessment of the obtained data is only an initial step which requires further mathematical processing. A useful approach is the creation of a database for storage of results, followed by data processing, moreover, not all data analysis procedures occur simultaneously. Since the SNP-genotyping machinery is designed for robot control and does not include resources for organization, storage and further processing of data, a three-layer software provides a solution. For these particular tasks an internet connection and an installed WEB browser are quite sufficient.
In our experimental work, the SNP-genotyping detection equipment employed a 384-well plate format. For technical reasons, DNA samples from four 96-well polymerase chain reaction (PCR) plates were used to compose a 384-well plate, where test samples are analyzed along with control samples. Thus, the resulting genotyping text file includes significantly perturbed data from probes of different groups (cases and controls). For the analysis of 100 SNPs in 1,000 DNA samples, we needed to prepare 400 plates of 384-wells, four plates with DNA samples from patients and healthy controls for each polymorphism. Consequently, statistical processing of 400 files with genotyping data was required.
Initially, we developed an appropriate WEB form, enabling the client (researcher) to input the IDs of patients and controls subjects, SNP IDs, the specific position in the PCR-plate of each subject, and the name of the file that contains the assay information (Figures 2A and 2B). From a drop-down menu at the top toolbar, a set of markers (SNPs) can be selected for analysis (custom design). Prior to testing, markers can be assigned into sets of SNPs of interest (i.e., names of SNPs are entered into the database in advance). The SNPs are designated with their unique RefSNP (rs) code according to the NCBI dbSNP, which comprise at least four numbers and a typing mistake can easily occur. By inserting the SNPs of interest in advance the program is enabled to control for typing and other errors.
The plate’s number toolbar is custom designed (Figure 2A), and allows the user to insert in a single step the design of all templates (384-well plates with DNA samples) used for the SNP genotyping. In our study, four types of templates (for 1,000 DNA samples) were designed. This step allows each position in a template (i.e., well with a DNA sample) to be recognized as a specific ID number that corresponds to a certain patient or healthy control. Thus, in the database, the genotyping data from individuals with the disease will be separated from the unaffected subjects, and will be arranged according to the list of IDs.
The BROWSE button permits the genotyping data text file to be attached. Each text file obtained from the genotyping machinery, contains data from one 384-well template, where information on the position of up to 384 DNA samples, the alternating alleles of one SNP, the DNA quality, the detection rate quality and other is integrated. After browsing the selected file, the SEND button sends the text files and all descriptive data to the server via the Internet. Server software processess the acquired information and applies the decoding scheme in accordance with the experimental conditions. Interpreted data are converted into a format in compliance with their preliminary properties and functions, and recorded in the database at a predefined position. In other words, the template ID, the position of the DNA sample, the genotyping data and the ID of the patient or the control subject are recognized and matched, and can be preserved structured in a database.
The main page of the interface allows the client to abide for errors in the selection of SNP and template IDs and get information on the progress of statistical processing. In brief, the statistical analysis is completed. The SECOND (part)-button enables quick link to page with precise statistical data in a suitable table format. A MAIN (page)-button enables returning to the main page (Figure 2B).
The Clear SNP-button enables erasing SNP data in DB, if necessary (in case of re-genotyping or detected errors) (Figure 2B). In case of accidental recording of new data over existing one, the client receives a message alert to accept or refuse a new entry. This prevents both duplicates and database disruption. Once stored into database, the data is transformed into a very convenient form for further processing.
Most of the statistical parameters of the investigated markers are obtained by standard operations of the database, whereas some are estimated by specific PHP commands. For each genetic marker the following parameters are presented in a practical format (Figure 2): i) allele and genotype frequencies in cases and healthy controls as absolute values and as percentage, in order to facilitate the comparison between genotyping data and data from the whole genome association study and the HapMap database; ii) statistical significance of the association between allele and genotype frequencies and phenotype expression, expressed in p values computed by the two-sided Fisher’s exact test; iii) identification of risk allele (the allele associated with increased risk for phenotype expression; the allele that is more common in case samples compared to control samples); iv) odds ratio (OD) (a statistical measure of the strength of association between having the risk factor if the disease is present compared to if it is absent) in accordance with the risk allele; v) the 95% confidence interval (95% CI).
The main advantage of the described product compared to the common electronic spreadsheet approach is the opportunity for establishing a structured database, which may be further processed further if necessary. For example, haplotype analysis, evaluation of correlations within different subgroups of subjects according to their age, gender, drug therapy applied.
|
|
|
|



 |
Number 27 VOL. 27 (2), 2024 |
Number 27 VOL. 27 (1), 2024 |
Number 26 Number 26 VOL. 26(2), 2023 All in one |
Number 26 VOL. 26(2), 2023 |
Number 26 VOL. 26, 2023 Supplement |
Number 26 VOL. 26(1), 2023 |
Number 25 VOL. 25(2), 2022 |
Number 25 VOL. 25 (1), 2022 |
Number 24 VOL. 24(2), 2021 |
Number 24 VOL. 24(1), 2021 |
Number 23 VOL. 23(2), 2020 |
Number 22 VOL. 22(2), 2019 |
Number 22 VOL. 22(1), 2019 |
Number 22 VOL. 22, 2019 Supplement |
Number 21 VOL. 21(2), 2018 |
Number 21 VOL. 21 (1), 2018 |
Number 21 VOL. 21, 2018 Supplement |
Number 20 VOL. 20 (2), 2017 |
Number 20 VOL. 20 (1), 2017 |
Number 19 VOL. 19 (2), 2016 |
Number 19 VOL. 19 (1), 2016 |
Number 18 VOL. 18 (2), 2015 |
Number 18 VOL. 18 (1), 2015 |
Number 17 VOL. 17 (2), 2014 |
Number 17 VOL. 17 (1), 2014 |
Number 16 VOL. 16 (2), 2013 |
Number 16 VOL. 16 (1), 2013 |
Number 15 VOL. 15 (2), 2012 |
Number 15 VOL. 15, 2012 Supplement |
Number 15 Vol. 15 (1), 2012 |
Number 14 14 - Vol. 14 (2), 2011 |
Number 14 The 9th Balkan Congress of Medical Genetics |
Number 14 14 - Vol. 14 (1), 2011 |
Number 13 Vol. 13 (2), 2010 |
Number 13 Vol.13 (1), 2010 |
Number 12 Vol.12 (2), 2009 |
Number 12 Vol.12 (1), 2009 |
Number 11 Vol.11 (2),2008 |
Number 11 Vol.11 (1),2008 |
Number 10 Vol.10 (2), 2007 |
Number 10 10 (1),2007 |
Number 9 1&2, 2006 |
Number 9 3&4, 2006 |
Number 8 1&2, 2005 |
Number 8 3&4, 2004 |
Number 7 1&2, 2004 |
Number 6 3&4, 2003 |
Number 6 1&2, 2003 |
Number 5 3&4, 2002 |
Number 5 1&2, 2002 |
Number 4 Vol.3 (4), 2000 |
Number 4 Vol.2 (4), 1999 |
Number 4 Vol.1 (4), 1998 |
Number 4 3&4, 2001 |
Number 4 1&2, 2001 |
Number 3 Vol.3 (3), 2000 |
Number 3 Vol.2 (3), 1999 |
Number 3 Vol.1 (3), 1998 |
Number 2 Vol.3(2), 2000 |
Number 2 Vol.1 (2), 1998 |
Number 2 Vol.2 (2), 1999 |
Number 1 Vol.3 (1), 2000 |
Number 1 Vol.2 (1), 1999 |
Number 1 Vol.1 (1), 1998 |
|
|