Team:Tsinghua-A/Engineering


Engineering Success

For all the algorithms we intorduced in Design, we have coded demos and achieved our design GOAL!

Feature Extracting VAE

In order to show our idea in a more vivid way, we design a website to better illustrate the whole pipeline through an interesting game. For every user, he/she can upload a photo that contains a human face. Once uploaded, the photo will go through the pre-trained model, it will first go through a VAE network, the feature obtained then will be passed through the encoder network to generate the respective DNA sequence. We have used celebA database (a large-scale face attributes dataset with more than 200K celebrity images) to generate a large DNA database. So once the DNA sequence of that photo is generated, it will match the most similar celebrity in the database!

Click Start Now!

Choose a picture and upload!

Get magic!

Data Encoding Algorithms

Image

Compressing

As for medical images, a 3 step-formed image out of 16 steps will be able to satisfy the need of diagnostic in most occasions. And a 3 step-formed image would only take 10% storage place compared to raw image, which saves a lot of storage space.

Encoding



By combining compressing algorthim and fountain code, we established an image file system with 2 dimensions. One dimension gives different images, the other gives different qualities.
For different images, different version of the image is selected and encoded for the DNA database, which can largely reduce the length of sequence needed.

Text

Randomly generate medical records

Because the medical records database is private and difficult to obtain, and considering the BP limitation of wet experiments, a simple database is randomly generated for demo writing. After investigation, we believe that the current use of medical records have written rules and language norms. Therefore, we can simplify the medical record into the collection of personal information of the patient and the factors of symptoms and corresponding time. (Name + time + symptoms)

Greedy algorithm

The core is for name coding and disease coding. Referring to "A Method for coding Symptoms and Signs applied to computer case management System" and "Diagnostics", the corresponding number of common symptoms is converted to fixed-length quaternary code.

Huffman algorithm

Considering the situation of a larger medical record database and referring to "the Study on Chinese Character Coding with Low Redundancy TWO-DIMENSIONAL Code", we counted word frequency and adopted Huffman algorithm for symptom coding. Frequency statistics of letters and symptom vocabulary were performed in a small database and coded. At the same time, the special coding sequence is used as the segmentation of three important information, which also makes it possible to use PCR for keyword retrieval later.

Waveform

It can be seen that the ECG data is relatively stable in a certain stage, and only changes dramatically in the part of R wave. According to the characteristics of ECG signals, storing the difference can save storage space effectively. So we use first-order difference method to the ECG data sequence.

As illustrated in the table, most of the results are between -8 and 7. There are very few values outside the -127 to 128 range. Hence, variant-length coding algorithm of 2-4-8-16 bits, a method similar to Huffman algorithm, is taken to compress the length of data. Then the data is written to a binary file and converted into fountain code.

Our experimental data comes from MIT-BIH Arrhythmia Database. At present, we have coded demo in MATLAB. The results showed that our method can compress the raw data to about 20% of the original size. The compression ratio is smaller than ZIP and RAR