Balkan Journal of Medical Genetics

WEB-BASED SOFTWARE FOR STORAGE, STATISTICAL PROCESSING AND ANALYSIS OF SNP DATA IN STUDIES ON COMPLEX DISORDERS
Betcheva E1, Betchev C2, Toncheva DI1,*
*Corresponding Author: Professor Draga Ivanova Toncheva, M.D., Ph.D., Department of Medical Genetics, Medical Faculty, Medical University, 2 Zdrave str., SBALAG “Maichin dom”, 6 Fl., 1431 Sofia, Bulgaria; Tel./Fax: +35-92-952-0357; E-mail: dragatoncheva@yahoo.com
page: 9

METHOD AND DISCUSSION

The application we have developed is based on a three-layer architecture model (Figure 1). The client system is an arbitrary browser (Mozilla, Internet Explorer, Opera, Safari, etc.). The WEB server is Apache on Linux deploying PHP scripting language and MySQL database. Data are transferred from a client’s computer to a host mail server as unconverted text file format with particular data structure (Figure 1).

The internet plays an important role in the development of novel systems and algorithmic models for information services application (ISA) [14]. Common internet services are as follows: electronic mail (e-mail), file transfer (ftp), and WEB hypertext transfer protocol for WEB-page browsing (http). Less popular but equally accessible are TELNET, an internet protocol for connecting to a remote server, and USENET, an on-line information interchange service. Connection and communication through the internet is possible regardless of differences in users’ platforms (operating systems and hardware) and software. This operational principle enables the establishment of a two-layer model for simple communication between Client and Server. A more complex three-layer model for data interchange is required when abundant and specific information management and its reliable storage are expected. It includes a Server-accessible database for management and storage of large amounts of structured information (Figure 1).

Purposeful use of software and technology and application of computer-based information and communication systems to achieve maximum efficiency of specific task management procedures is defined as modern information technology (IT). This enables highly effective utilization of time and resources, and also reveals new opportunities for task performance. At present, use of the internet is an integral part of contemporary IT and of virtual environment for transfer of large amounts of data. This relates to various specific activities in the field of humanities (http://mysql.com/ index.html) [16-18].

In order to reduce the cost and to preserve the power of the analysis, some authors designed the whole-genome approach in a two-step manner. A pilot fraction of samples is selected for high-throughput genotyping by microarray technology. Subsequently, a number of top markers is chosen for genotyping in a second sample set [13,15]. We have adopted such an approach for our study. Subsequently to execution of the validation and replication studies, a highly efficient and reliable statistical processing of genotyping data for 100 genetic markers in 1,000 DNA samples was required. The common procedure for data processing includes interventions such as manual transfer and conversion of text files (containing unnecessary additional information) in electronic spreadsheet (for example in Microsoft EXCEL), animated by macro commands, in order to evaluate certain quantities. Each manual procedure consumes considerable time and resources. It increases the risk of disruption by human error and of completely erroneous interpretations. The quantitative assessment of the obtained data is only an initial step which requires further mathematical processing. A useful approach is the creation of a database for storage of results, followed by data processing, moreover, not all data analysis procedures occur simultaneously. Since the SNP-genotyping machinery is designed for robot control and does not include resources for organization, storage and further processing of data, a three-layer software provides a solution. For these particular tasks an internet connection and an installed WEB browser are quite sufficient.

In our experimental work, the SNP-genotyping detection equipment employed a 384-well plate format. For technical reasons, DNA samples from four 96-well polymerase chain reaction (PCR) plates were used to compose a 384-well plate, where test samples are analyzed along with control samples. Thus, the resulting genotyping text file includes significantly perturbed data from probes of different groups (cases and controls). For the analysis of 100 SNPs in 1,000 DNA samples, we needed to prepare 400 plates of 384-wells, four plates with DNA samples from patients and healthy controls for each polymorphism. Consequently, statistical processing of 400 files with genotyping data was required.

Initially, we developed an appropriate WEB form, enabling the client (researcher) to input the IDs of patients and controls subjects, SNP IDs, the specific position in the PCR-plate of each subject, and the name of the file that contains the assay information (Figures 2A and 2B). From a drop-down menu at the top toolbar, a set of markers (SNPs) can be selected for analysis (custom design). Prior to testing, markers can be assigned into sets of SNPs of interest (i.e., names of SNPs are entered into the database in advance). The SNPs are designated with their unique RefSNP (rs) code according to the NCBI dbSNP, which comprise at least four numbers and a typing mistake can easily occur. By inserting the SNPs of interest in advance the program is enabled to control for typing and other errors.

The plate’s number toolbar is custom designed (Figure 2A), and allows the user to insert in a single step the design of all templates (384-well plates with DNA samples) used for the SNP genotyping. In our study, four types of templates (for 1,000 DNA samples) were designed. This step allows each position in a template (i.e., well with a DNA sample) to be recognized as a specific ID number that corresponds to a certain patient or healthy control. Thus, in the database, the genotyping data from individuals with the disease will be separated from the unaffected subjects, and will be arranged according to the list of IDs.

The BROWSE button permits the genotyping data text file to be attached. Each text file obtained from the genotyping machinery, contains data from one 384-well template, where information on the position of up to 384 DNA samples, the alternating alleles of one SNP, the DNA quality, the detection rate quality and other is integrated. After browsing the selected file, the SEND button sends the text files and all descriptive data to the server via the Internet. Server software processess the acquired information and applies the decoding scheme in accordance with the experimental conditions. Interpreted data are converted into a format in compliance with their preliminary properties and functions, and recorded in the database at a predefined position. In other words, the template ID, the position of the DNA sample, the genotyping data and the ID of the patient or the control subject are recognized and matched, and can be preserved structured in a database.

The main page of the interface allows the client to abide for errors in the selection of SNP and template IDs and get information on the progress of statistical processing. In brief, the statistical analysis is completed. The SECOND (part)-button enables quick link to page with precise statistical data in a suitable table format. A MAIN (page)-button enables returning to the main page (Figure 2B).

The Clear SNP-button enables erasing SNP data in DB, if necessary (in case of re-genotyping or detected errors) (Figure 2B). In case of accidental recording of new data over existing one, the client receives a message alert to accept or refuse a new entry. This prevents both duplicates and database disruption. Once stored into database, the data is transformed into a very convenient form for further processing.

Most of the statistical parameters of the investigated markers are obtained by standard operations of the database, whereas some are estimated by specific PHP commands. For each genetic marker the following parameters are presented in a practical format (Figure 2): i) allele and genotype frequencies in cases and healthy controls as absolute values and as percentage, in order to facilitate the comparison between genotyping data and data from the whole genome association study and the HapMap database; ii) statistical significance of the association between allele and genotype frequencies and phenotype expression, expressed in p values computed by the two-sided Fisher’s exact test; iii) identification of risk allele (the allele associated with increased risk for phenotype expression; the allele that is more common in case samples compared to control samples); iv) odds ratio (OD) (a statistical measure of the strength of association between having the risk factor if the disease is present compared to if it is absent) in accordance with the risk allele; v) the 95% confidence interval (95% CI).

The main advantage of the described product compared to the common electronic spreadsheet approach is the opportunity for establishing a structured database, which may be further processed further if necessary. For example, haplotype analysis, evaluation of correlations within different subgroups of subjects according to their age, gender, drug therapy applied.

Number 27 VOL. 27 (2), 2024	Number 27 VOL. 27 (1), 2024
Number 26 Number 26 VOL. 26(2), 2023 All in one	Number 26 VOL. 26(2), 2023
Number 26 VOL. 26, 2023 Supplement	Number 26 VOL. 26(1), 2023
Number 25 VOL. 25(2), 2022	Number 25 VOL. 25 (1), 2022
Number 24 VOL. 24(2), 2021	Number 24 VOL. 24(1), 2021
Number 23 VOL. 23(2), 2020	Number 22 VOL. 22(2), 2019
Number 22 VOL. 22(1), 2019	Number 22 VOL. 22, 2019 Supplement
Number 21 VOL. 21(2), 2018	Number 21 VOL. 21 (1), 2018
Number 21 VOL. 21, 2018 Supplement	Number 20 VOL. 20 (2), 2017
Number 20 VOL. 20 (1), 2017	Number 19 VOL. 19 (2), 2016
Number 19 VOL. 19 (1), 2016	Number 18 VOL. 18 (2), 2015
Number 18 VOL. 18 (1), 2015	Number 17 VOL. 17 (2), 2014
Number 17 VOL. 17 (1), 2014	Number 16 VOL. 16 (2), 2013
Number 16 VOL. 16 (1), 2013	Number 15 VOL. 15 (2), 2012
Number 15 VOL. 15, 2012 Supplement	Number 15 Vol. 15 (1), 2012
Number 14 14 - Vol. 14 (2), 2011	Number 14 The 9th Balkan Congress of Medical Genetics
Number 14 14 - Vol. 14 (1), 2011	Number 13 Vol. 13 (2), 2010
Number 13 Vol.13 (1), 2010	Number 12 Vol.12 (2), 2009
Number 12 Vol.12 (1), 2009	Number 11 Vol.11 (2),2008
Number 11 Vol.11 (1),2008	Number 10 Vol.10 (2), 2007
Number 10 10 (1),2007	Number 9 1&2, 2006
Number 9 3&4, 2006	Number 8 1&2, 2005
Number 8 3&4, 2004	Number 7 1&2, 2004
Number 6 3&4, 2003	Number 6 1&2, 2003
Number 5 3&4, 2002	Number 5 1&2, 2002
Number 4 Vol.3 (4), 2000	Number 4 Vol.2 (4), 1999
Number 4 Vol.1 (4), 1998	Number 4 3&4, 2001
Number 4 1&2, 2001	Number 3 Vol.3 (3), 2000
Number 3 Vol.2 (3), 1999	Number 3 Vol.1 (3), 1998
Number 2 Vol.3(2), 2000	Number 2 Vol.1 (2), 1998
Number 2 Vol.2 (2), 1999	Number 1 Vol.3 (1), 2000
Number 1 Vol.2 (1), 1999	Number 1 Vol.1 (1), 1998

About the journal ::: Editorial ::: Subscription ::: Information for authors ::: Contact