WEGAN

Frequently Asked Questions (FAQs):

Data Processing
Statistics

When should I use Wegan?
What types of input does Wegan accept?
Is the data I uploaded kept confidential?
How are missing values dealt with?
How can outliers be identified and dealt with?
Some images did not show up after I click the corresponding tab (PCA, PLS-DA)?
Can I analyze unlabeled data?
What is generalized logarithm transformation?
How does "Auto/Pareto/Range scaling" work?
When should I use Data editor or Data filter ?

When should I use Wegan?

Wegan (Web-based Ecological Group Analysis) is a free, user-friendly tool for analyzing ecology data. It can help with exploratory data analysis; diversity, ordination, taxonomic comparisons; and generating statistics and figures for use in publications. It is powered by R and useful for those less familiar with R programming.
What types of input does Wegan accept?

Wegan was designed primarily for community abundance data, in the form of tables commonly recognizable with sites or samples in rows and species or taxa in columns. If your data has these in reverse (i.e. the species in rows and sites in columns), indicate so on data upload. The Plotting and Correlation modules can process categorical data along with numeric data in the same table, but all others must be community abundance data. For sample files and format specification, please see "Data Formats" for more details.
Is the data I uploaded kept confidential?
Yes. The data files you upload for analysis as well as any analysis results, are not downloaded or examined in any way by the administrators, unless required for system maintenance and troubleshooting. All files are deleted from the server after no more than 72 hours, and no archives or backups are kept. You are advised to download your results as an zip immediately after performing an analysis.
How are missing values dealt with?

Missing values should be presented either as empty values or NA without quotes. Any other symbol will be treated as string character and will cause errors during data processing. Wegan offers a variety of methods to deal with missing values. Missing values must be dealt with during the data upload phase. This decision was made to streamline the user interface and available analysis methods provided. Be aware that by default, missing values are replaced by half of the minimum positive values detected in the data, which may or may not be appropriate for your analysis goals. Users can also specify other methods, such as replace by mean/median, Probabilistic PCA (PPCA), Bayesian PCA (BPCA) method, or Singular Value Decomposition (SVD) method to impute the missing values (Stacklies W. et al).
How can outliers be identified and dealt with?

Potential outliers can be identified from PCA or PLS-DA plots. The scores plot can be used to identify sample outliers, while the loadings plot can be used to identify feature outliers. The potential outlier will distinguish itself as the one located far away from the major clusters formed by the remaining.

To deal with outliers, verify correct sample measurements. If those values cannot be corrected, they should be removed from analysis. Wegan provides DataEditor to enable easy removal of sample/feature outliers. Please note, you may need to re-normalize the data after outlier removal.
Some images did not show up after I click the corresponding tab?

This implies Wegan failed to execute the command using the given parameters. Users should try to adjust parameter values. We found in most cases, the problem is associated with sample size. In particular, if the sample size is very small (below 10), some unpredictable error may happen. For instance, by default PCA and PLSDA will try to generate summary/classification/permutation plot for the top 5 components, if the sample size is too small, it will fail to do so.
Can I analyze unlabeled data?

There are several unsupervised methods (PCA, hierarchical clustering, SOM, K-means) that can be used to detect inherent patterns in unlabeled data. However you need to trick Wegan to accept the data by providing dummy two-group labels . In this case, results from feature selection or supervised classification methods will be meaningless.
What is generalized logarithm transformation (glog)?

Generalized logarithm (glog) is a simple variation of ordinary log in order to deal with zero or negative values in the data set. It has many desirable features (for details, see Durbin BP. et al Its formula is shown below:
where a is a constant with a default value of 1.
How does "Auto/Pareto/Range scaling" work?

Please see the following summary table by van den Berg et al . Here S_i is standard deviation.
When should I use Data editor or Data filter ?

The purposes of data editor and data filter are to help improve the quality of data for better separation, prediction or interpretation. In particular, user can use data editor to remove outlier(s) which can be visually identified from PCA or PLS-DA scores plots); user can use data filter to remove noisy or uninformative features (i.e. baseline noises, near-constant-features). These features tend to dilute the signal and decrease the performance of most the statistical procedures. Be removing outliers and low-quality features, the resulting data will be more consistent and reliable.

What are the different marks in the boxplot based on?
What's the difference between "fold change" and "paired fold change" analyses?
What's the difference between VIP and coefficient - based feature selection (PLS-DA) ?
How to interpret the permutation results (PLS-DA) ?
Why the ranks of important features identified by different methods are different?
Where can I get more information about random forest?
How does recursive SVM (RSVM) work?
Can I use the selected features for classification?

What are the different marks in the boxplot based on?

In a boxplot, the bottom and top of the box are always the 25th and 75th percentile (the lower and upper quartiles, or Q1 and Q3, respectively), and the band near the middle of the box is always the 50th percentile (the median or Q2). The upper whisker is located at the smaller of the maximum x value and Q3 + 1.5*IQR (Interquantile Range), whereas the lower whisker is located at the larger of the smallest x value and Q1 - 1.5*IQR.
What's the difference between "fold change" and "paired fold change" analyses?

The purpose of fold change (FC) analysis is to compare absolute value change between two group averages. In paired fold change analysis, users aim to find some features that are changed consistently (i.e. up-regulated or down-regulated, but not both) between two groups. The consistency is measured as a percentage - (# of pairs with consistent change above a given FC threshold) / (total # of paired samples). Users need to specify two thresholds - fold change threshold and significant counts (the percentage).
What's the difference between VIP and coefficient - based feature selection (PLS-DA) ?

There are two variable importance measures in PLS-DA. The first, Variable Importance in Projection (VIP), is a weighted sum of squares of the PLS loadings taking into account the amount of explained Y-variation, in each dimension. Please note, since VIP scores are calculated for each components, when more than components are used to calculate the feature importance, the average of the VIP scores are used. The other importance measure is based on the weighted sum of PLS-regression. The weights are a function of the reduction of the sums of squares across the number of PLS components. Please note, for multiple-group (more than two) analysis, the same number of predictors will be built for each group. Therefore, the coefficient of each feature will be different depending on which group you want to predict. The average of the feature coefficients are used as the overall coefficient-based importance.
How to interpret the permutation result (PLS-DA) ?

It is well known that when there are too many variables and a small sample size, many supervised classification algorithms tend to overfit the data. That is, even there is no actual difference between the groups, the program will still be able to discriminate them by picking some features that are "different" between these two group by pure chance! Of course, the classifier will be useless for new data since the pattern it detected is not real (not significant).

The purpose of a permutation test is to answer the question - "what is the performance if the groups are formed randomly". The program uses the same data set with its class labels reassigned randomly. It then builds a new classifier, its performance is then evaluated. The process will repeat many times to estimate the distribution of the performance measure (not necessarily follows a normal distribution). By comparing the performance using the original label and the performance based on the randomly labeled data, one can see if the former is significantly different from the latter. For PLS-DA, the performance is measured using prediction accuracy or group separation distance using the "B/W ratio" (as suggested by Bijlsma et al.). The further away to the right of the distribution formed by randomly permuted data, the more significant the discrimination. The p-value is calculated as the proportion of the times that class separation based on randomly labeled sample is at least as good as the one based on the original data (one-sided p value).
Why the ranks of important features identified by different methods are different ?

Different methods use different criteria for ranking the features. The choice of method used can greatly affect the set of features that are identified. This can be illustrated in the simple cases such as fold change v.s. t-tests, in which the former is interested in absolute change in concentrations/intensities, while the latter focuses on changes relative to the underlying noises. PLS-DA is based on linear regression; SAM and EBAM usually produce similar results since they are all based on t statistics; both random forests and SVM are quite distinctive from all other methods by using an ensemble of classification trees or projections to hyperplanes, respectively. Therefore, these methods tend to generate different results.
Where can I get more information about random forest ?

A detailed description about how this algorithm works can be found here.

How does recursive SVM (RSVM) work ?

This is the workflow of RSVM algorithm described by Zhang X, et al.. Recursive SVM uses SVM for both classification and for selecting a subset of relevant genes according to their relative contribution in the classification. This process is done recursively so that a series of data subsets and classification models can be obtained in a recursive manner, at different levels of feature selection. The performance of the classification can be evaluated either on an independent test data set or by cross validation on the same data set. R-SVM also includes an option for permutation experiments to assess the significance of the performance. Please note, only linear kernel was used for classification, since the information is usually far from sufficient for reliably estimating nonlinear relations for high-dimensional data with a small sample size. First-order approximation will reduce the risk of overfitting in case of limited data. CV2 refers to the cross validation embedded with the feature selection procedure as discussed above.

Can I use the selected features for classification ?

Cautions must be taken for the practice. If the features are selected based on the background biological knowledge, it's fine to use the selected subset of data for classification. However, it is not proper to use the features selected based on the whole dataset using some supervised methods (methods that utilize the class label information such as t-tests, SAM, PLS-DA or other supervised classification methods). This procedure will introduce selection bias and result in a very optimistic performance based on cross-validation due to "information leak". In order to obtain an objective performance evaluation, one should include the feature selection procedure in the cross validation. Alternatively, one can evaluate the classifier using independent dataset not used in the feature selections. These procedures were used by PLS-DA, random forest and R-SVM for feature selection and classification. Currently, Wegan does not support feature selection and classification using different algorithms (i.e. PLS-DA for classification). This is going to be our next step. Please refer to paper by Ambroise C and McLachlan GJ for a more detailed discussion.

However, one can use non-specific filtering (i.e. not using the class labels) to select features. We recently introduced the Data filter function under the Data Process category. Users can use this function to remove low quality features using a variety of criteria. Additionally, users can also try to remove sample outliers if exists using the Data editor. These are safe procedures that can potentially improve the classification performance.