We propose to combine cepstrum and nonlinear time–frequency (TF) analysis
to study multiple component oscillatory signals with time-varying frequency and
amplitude and with time-varying non-sinusoidal oscillatory pattern. The concept of
cepstrum is applied to eliminate the wave-shape function influence on the TF analysis,
and we propose a new algorithm, named de-shape synchrosqueezing transform (deshape
SST). The mathematical model, adaptive non-harmonic model, is introduced
and the de-shape SST algorithm is theoretically analyzed. In addition to simulated
signals, several different physiological, musical and biological signals are analyzed to
illustrate the proposed algorithm.
Chenglong YuSouth Australian Health and Medical Research InstituteBernhard T. BauneUniversity of AdelaideJulio LicinioSouth Australian Health and Medical Research InstituteMa-Li WongSouth Australian Health and Medical Research Institute
Data Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:1703.42005
Major depressive disorder (MDD) is highly prevalent, resulting in an exceedingly high disease burden. The identification of generic risk factors could lead to advance prevention and therapeutics. Current approaches examine genotyping data to identify specific variations between cases and controls. Compared to genotyping, whole-genome sequencing (WGS) allows for the detection of private mutations. In this proof-of-concept study, we establish a conceptually novel computational approach that clusters subjects based on the entirety of their WGS. Those clusters predicted MDD diagnosis. This strategy yielded encouraging results, showing that depressed Mexican-American participants were grouped closer; in contrast ethnically-matched controls grouped away from MDD patients. This implies that within the same ancestry, the WGS data of an individual can be used to check whether this individual is within or closer to MDD subjects or to controls. We propose a novel strategy to apply WGS data to clinical medicine by facilitating diagnosis through genetic clustering. Further studies utilising our method should examine larger WGS datasets on other ethnical groups.
The International Committee on Taxonomy of Viruses authorizes and organizes the taxonomic classification of viruses. Thus
far, the detailed classifications for all viruses are neither complete nor free from dispute. For example, the current missing
label rates in GenBank are 12.1% for family label and 30.0% for genus label. Using the proposed Natural Vector
representation, all 2,044 single-segment referenced viral genomes in GenBank can be embedded in R^12. Unlike other
approaches, this allows us to determine phylogenetic relations for all viruses at any level (e.g., Baltimore class, family,
subfamily, genus, and species) in real time. Additionally, the proposed graphical representation for virus phylogeny provides
a visualization of the distribution of viruses in R^12. Unlike the commonly used tree visualization methods which suffer from
uniqueness and existence problems, our representation always exists and is unique. This approach is successfully used to
predict and correct viral classification information, as well as to identify viral origins; e.g. a recent public health threat, the
West Nile virus, is closer to the Japanese encephalitis antigenic complex based on our visualization. Based on cross validation
results, the accuracy rates of our predictions are as high as 98.2% for Baltimore class labels, 96.6% for family
labels, 99.7% for subfamily labels and 97.2% for genus labels.
Current methods cannot tell us what the nature of the protein universe is concretely. They are based on different models of amino acid substitution and multiple sequence alignment which is an NP-hard problem and requires manual intervention. Protein structural analysis also gives a direction for mapping the protein universe. Unfortunately, now only a minuscule fraction of proteins' 3-dimensional structures are known. Furthermore, the phylogenetic tree representations are not unique for any existing tree construction methods. Here we develop a novel method to realize the nature of protein universe. We show the protein universe can be realized as a protein space in 60-dimensional Euclidean space using a distance based on a normalized distribution of amino acids. Every protein is in one-to-one correspondence with a point in protein space, where proteins with similar properties stay close together. Thus the distance between two points in protein space represents the biological distance of the corresponding two proteins. We also propose a natural graphical representation for inferring phylogenies. The representation is natural and unique based on the biological distances of proteins in protein space. This will solve the fundamental question of how proteins are distributed in the protein universe.
The free-living SAR11 clade is a globally abundant group of oceanic Alphaproteobacteria, with small genome sizes and rich genomic A+T content. However, the taxonomy of SAR11 has become controversial recently. Some researchers argue that the position of SAR11 is a sister group to Rickettsiales. Other researchers advocate that SAR11 is located within free-living lineages of Alphaproteobacteria. Here, we use the natural vector representation method to identify the evolutionary origin of the SAR11 clade. This alignment-free method does not depend on any model assumptions. With this approach, the correspondence between proteome sequences and their natural vectors is one-to-one. After fixing a set of proteins, each bacterium is represented by a set of vectors. The Hausdorff distance is then used to compute the dissimilarity distance between two bacteria. The phylogenetic tree can be reconstructed based on these distances. Using our method, we systematically analyze four data sets of alphaproteobacterial proteomes in order to reconstruct the phylogeny of Alphaproteobacteria. From this we can see that the phylogenetic position of the SAR11 group is within a group of other free-living lineages of Alphaproteobacteria.