Overall Structure Design
In order to deal with all the data, we need a specific structure to hold all the information, or the DNAs. Also, as for images, instead of just storing them, we designed an algorithm to extract their feature and have all the features stored in specific DNA sequences. This will benefit the diagnostic process by providing images with similar feature to the doctor in no time, and similar features always mean similar symptoms which means these two patients can be treated in similar way.
So, in general, we put all the data in two different databases. The main database contains mainly all the medical information, and the second one contains feature tags of all the images. The databases are all formed of DNA chains containing information. Each of the chains are composed of three parts: independent primer, payload, and shared primer.
As for the main database, independent primer are primers that identify each patient. For a specific patient, chains containing his information will use the same primer on one side. Payload is the main part of the chain, containing parts of medical information for a patient. Due to the limitation on sequence length of DNA synthesis, some information may need several chains to encode. We designed algorithms ensuring that even if a few chains are lost, the information can still be extracted and recovered. Shared primer is on the other end of the DNA chain. The same kind of information shares the same shared primer. As for the feature database, things are a little bit different. The independent primer of these chains are feature primers given by feature extracting algorithm which we will introduce later. The payload of the chains are image IDs, which can be used to search for a specific image in the main database. Shared primer remains the same in the feature database.
Feature Extracting Algorithm
System Design
We combine Artificial Neural Networks with DNA storage. For the medical image data set, the feature of the images in the data set are extracted by VAE, a powerful tool for projecting high dimension input into lower dimension feature space in machine learning area, and the primer are synthesized according to the obtained features. Through such a coding method, we can achieve the effect of "the more similar the original image, the more similar the sequence generated". The primers designed in this way are also the basis of our data retrieval and the realization of many functions in our project.
The Work Flow
The medical image is taken as the input of neural network, and a set of feature vectors are obtained at the output end. Then, through another neural network, we can encode these feature vectors into DNA sequences that satisfy the biological characteristics. This sequence will be used as the primer for each sequence in our downstream DNA database. Since finding the best match DNA is a chemical-like reaction and trying to find the best match in the digital world need be done one by one, this procedure thus show great promise once the database gets extremely large.
Query Mechanism
For a medical image, we still use the above two neural networks to encode its primer. Next, PCR is performed in the DNA database using this primer sequence.Due to the existence of non-specific amplification in PCR, we will screen out a group of images with high similarity to the query image. The technical details will be covered in another module.
Data Encoding Algorithm
Images
Image can be one of the most important part of medical data. A lot of inspections are based on graphic results, for example, CT, X-ray, and MR. Dealing with these images can be quite a challenge. Compared to other types of data, they have larger file sizes, which means more sequence are needed to store them.
But for some of the images,these images can be used for diagnostic even if they are not that clear. So based on the existing progressive JPEG method, which encode the image step by step and divide it into several parts, we designed an algorithm for image pre-processing. We divide the image into several parts, one of them fundamental, others additional. The fundamental part gives a full image but at quite low quality. Additional parts are then added to the fundamental part, forming images with better quality progressively, until the raw image.
Text
Medical record is a systematic record of the occurrence, development, diagnosis and treatment of a patient's disease. It not only truly reflects the patient's condition, but also provides extremely valuable basic data for medical treatment, scientific research and teaching. At present, medical data has exploded. As a very important part of medical data, written medical record data conforms to the characteristics of cold data and has long-term application value. Thus, the use of physical storage, such as DNA storage, has unique advantages.
There are many precedents for the storage of written DNA. Considering the particularity of written medical records, we designed a simple keyword coding method, and adopted greedy algorithm, Huffman algorithm and fountain code, etc. At the same time, we set up a small medical record database for coding practice.
Waveform
Nowadays cardiovascular disease is one of the main diseases threatening people's life. Electrocardiogram(as ECG) can provide diagnostic information on the surface activity of heart disease. Therefore, the clinical ECG examination is of significance to the diagnosis of heart. With the popularization and development of computer technology, in order to analysis more accurate date for clinical diagnosis, the requirements on the sampling frequency, sampling accuracy and sampling time are higher. The compression and storage of ECG signals has become a significant issue.
At present, the main compression methods of ECG include direct compression method, feature parameter extraction algorithm and transform compression method.We designed a kind of compression method using the differential coding and variant-length coding algorithm of 2-4-8-16 bits.