Changes in version 2.1.0 Breaking Changes - create_diagnostic_plots() renamed to plot_mcmc_diagnostics() and its parameter res renamed to dpi for naming consistency New Features Subsampling for Computational Efficiency - Major Feature: Added the optional opt_subsample parameter to key optimization functions, enabling efficient parameter optimization on large datasets while maintaining final embedding quality. Parameter optimization still works reliably with subsampling, because likelihoods of samples of the same size are comparable, allowing us to choose the optimal parameter values. - New Functions: - check_matrix_connectivity(): Validates that a dissimilarity matrix forms a connected graph - subsample_dissimilarity_matrix(): Creates random subsamples with automatic connectivity validation and adaptive size adjustment - sanity_check_subsample(): Validates subsample suitability for cross-validation - prune_sparse_matrix(): Prunes sparse dissimilarity matrices to a well-connected subset - Enhanced Functions: - initial_parameter_optimization(): Now accepts opt_subsample parameter - run_adaptive_sampling(): Now accepts opt_subsample parameter - adaptive_MC_sampling(): Now accepts opt_subsample parameter (internal) - Euclidify(): Now accepts opt_subsample parameter How Subsampling Works When opt_subsample is specified: 1. Each parameter evaluation uses a random subsample of the specified size 2. Connectivity is automatically validated; disconnected subsamples are rejected 3. If connectivity fails, sample size needs to be increased 4. Different parameter evaluations use different subsamples for robustness 5. Final embedding always uses the full dataset The opt_subsample parameter is optional (default: NULL = use full data). Performance Benefits - Speeds up parameter optimization by ~10-50x on large datasets (>500 points) - Reduces memory usage proportional to subsample size - Parameters found on subsamples generalize well to full data Other changes - Package gridExtra is a required import now. Recommendations - Datasets < 500 points: Use full data (opt_subsample = NULL) - Datasets > 500 points: Recommended opt_subsample = 200-500 - Always ensure opt_subsample >= folds for reliable cross-validation Bug Fixes - Conversion of matrices to numeric in "R/adaptive_sampling.R" are now properly handled by extract_numeric_values() function of the Topolow package. Improvements - Enhanced connectivity checking using igraph - Better error messages for disconnected data - Adaptive strategies for handling sparse data - Comprehensive logging of subsampling operations - New diagnostic plots including MCMC exploration and parameter fit traces New changes towards v3: #' The function performs these steps in an epoch-based evolutionary strategy: #' 1. Initialization: Starts with the user-provided parameter ranges. #' 2. Epoch Loop: For each epoch: #' a. Generates num_samples using LHS within the current parameter ranges. #' b. If opt_subsample is specified, each evaluation uses a random subsample. #' c. Evaluates parameter sets via cross-validation (in parallel batches). #' d. Range Update (after all but the final epoch): #' - Sorts results by NLL and keeps the top 50%. #' - Updates parameter ranges for the next epoch based on survivors: #' New Min = 0.75 * Min(Survivors), New Max = 1.25 * Max(Survivors). #' - This allows the search to drift and zoom in on optimal regions. #' 3. Finalization: Automatically log-transforms the results from the final epoch #' for direct use with adaptive sampling. #' @param epochs Integer. Number of optimization epochs. In each epoch, parameters are sampled, #' evaluated, and the best 50% are used to refine the search space for the next epoch. #' Default: 3. C++ Backend for Core Optimization (Performance) - Major Performance Enhancement: The core optimization loop in euclidean_embedding() has been rewritten in C++ using Rcpp and RcppArmadillo, providing significant speedups for large datasets. All for loops in the core function euclidean_embedding() have been replaced with vector operations. - New Algorithm: Negative Sampling (not used) - Implements negative sampling to approximate unmeasured pair repulsion - Reduces complexity from O(N²) per iteration to O(E × k), where E is the number of measured edges and k is the number of negative samples - New parameter n_negative_samples (default: 5) controls the approximation quality vs. speed tradeoff - Particularly beneficial for sparse matrices (>90% missing values) - New Parameters for euclidean_embedding(): - n_negative_samples: Number of negative samples per edge endpoint (default: 5). Higher values better approximate the original O(N²) algorithm but increase computation time. - convergence_check_freq: How often to check for convergence in iterations (default: 10). Lower values give more precise stopping but add overhead. - Implementation Details: - COO Format: Uses Coordinate List format for edge data to avoid sparse matrix zero-dropping issues - Edge Shuffling: C++ native random number generator (std::mt19937) for stochastic edge ordering, critical for escaping local optima - Immediate Updates: Preserves Gauss-Seidel style position updates from the original R implementation for identical convergence behavior - Vectorized Error Calculation: Uses Armadillo batch operations for computing MAE during convergence checks - Cache-Friendly Layout: Edge data stored in contiguous arrays for better CPU cache utilization - Pre-computed Factors: Degree-based normalization factors computed once before optimization - Direct Memory Access: Bypasses Armadillo accessors for position updates in the inner loop - Return Value Enhancement: The convergence field in the returned topolow object now includes: - achieved: Boolean indicating whether convergence was reached - error: Final MAE on active constraints - final_k: Final spring constant value after cooling - Dependencies: Added RcppArmadillo to LinkingTo (compile-time only, no runtime dependency added) Performance Comparison | Dataset Size | Sparsity | R (v2.0) | C++ (v2.1) | Speedup | |--------------|----------|----------|------------|---------| | 100 points | 50% | ~2s | ~0.3s | ~7× | | 500 points | 80% | ~45s | ~4s | ~11× | | 1000 points | 95% | ~180s | ~12s | ~15× | Benchmarks on 1000 iterations, 3 dimensions. Actual speedup varies with data characteristics. Backward Compatibility - All existing code using euclidean_embedding() will work without modification - Default parameter values preserve original algorithm behavior - Output structure remains compatible with downstream functions (Euclidify(), parameter optimization, etc.) Changes in version 2.0.1 (2025-08-30) Included figures in the vignette. Changes in version 2.0.0 (2025-08-19) The wizard function Euclidify was added to run all the workflow needed to get the main output automatically. Deprecations - create_topolow_map() is now deprecated in favor of euclidean_embedding(). The old function will be removed in version 3.0.0. - Parameter name changed: distance_matrix --> dissimilarity_matrix - Function name changed: create_topolow_map() --> euclidean_embedding() Breaking Changes - initial_parameter_optimization(): Parameter distance_matrix renamed to dissimilarity_matrix - Migration: Replace distance_matrix = your_matrix with dissimilarity_matrix = your_matrix - run_adaptive_sampling(): Parameter distance_matrix renamed to dissimilarity_matrix - Migration: Replace distance_matrix = your_matrix with dissimilarity_matrix = your_matrix - adaptive_MC_sampling(): - Parameter distance_matrix renamed to dissimilarity_matrix - Removed parameter batch_size from adaptive_MC_sampling(); its value had no effect in the processes anyway - Removed parameter num_parallel_jobs from run_adaptive_sampling; set max_cores to define the number of cores and parallel jobs - Migration: Replace distance_matrix = your_matrix with dissimilarity_matrix = your_matrix and remove batch_size arguments - create_cv_folds(): Parameter names and return structure changed - Parameter changes: truth_matrix --> dissimilarity_matrix, no_noise_truth --> ground_truth_matrix - Return structure: Now returns named list elements ($truth, $train) instead of indexed elements - Migration: Update parameter names and change result[[1]][[1]] to result[[1]]$truth, result[[1]][[2]] to result[[1]]$train - take_log parameter in clean_data() is deprecated - Perform log transformation before calling these functions instead - Parameter will be removed in next major version - analyze_network_structure(): Parameter distance_matrix renamed to dissimilarity_matrix for consistency with other functions - calculate_diagnostics(): Return class changed from topolow_amcs_diagnostics to topolow_diagnostics for naming consistency - plot_network_structure(): Removed aesthetic_config and layout_config parameters - Migration: Replace with width, height, dpi parameters - Fixed aesthetic values improve consistency but reduce customization - Added better handling for empty network cases - scatterplot_fitted_vs_true(): Parameter names updated for consistency - Migration: distance_matrix --> dissimilarity_matrix, p_dist_mat --> p_dissimilarity_mat - Migration: Default save_plot changed from TRUE to FALSE - Improved modern ggplot2 syntax using linewidth instead of deprecated size - error_calculator_comparison(): Parameter names changed for consistency - p_dist_mat --> predicted_dissimilarities - truth_matrix --> true_dissimilarities - input_matrix --> input_dissimilarities (now optional, defaults to NULL) - Migration: Update all parameter names in function calls - calculate_prediction_interval(): Parameter names changed for consistency - distance_matrix --> dissimilarity_matrix - p_dist_mat --> predicted_dissimilarity_matrix - Migration: Update parameter names in function calls - long_to_matrix was renamed to titers_list_to_matrix since it is specific to viral titer data processing. - Function process_antigenic_data accepts a data frame as input, instead of the previous form of a file path. - In process_antigenic_data, is_titer became is_similarity for clearity for broader audience. Parameter id_prefix was removed. New Features - Added euclidean_embedding() function with enhanced performance and features: - Matrix reordering: Automatic spectral ordering concentrates largest dissimilarity values in corners for improved optimization - Enhanced validation: Better input data quality checks with informative warnings - Improved documentation: More detailed examples and parameter guidance Improvements - Package dependencies where reduced from 20 to 13 - Enhanced algorithm documentation with clearer physics-inspired approach description - Better handling of edge cases in dissimilarity matrix processing - Improved error messages for parameter validation - Updated parameter_sensitivity function to use modern ggplot2 syntax - Improved input validation and error handling in sensitivity analysis - Enhanced MLE calculation algorithm for better robustness - Replaced deprecated size parameter with linewidth in plots - Enhanced input validation and error messages in create_cv_folds() - input_dissimilarities parameter now optional in error_calculator_comparison() - initial_parameter_optimization saves/returns the parameters in log scale, consistent with other function - A vignette was added Deprecation Timeline - Version 2.0.0: create_topolow_map() deprecated, issues warning - Version 3.0.0 (planned): create_topolow_map() will be removed Migration Guide To update your code: # Old (deprecated): result <- create_topolow_map(distance_matrix = my_matrix, # ... other parameters ) # New (recommended): result <- euclidean_embedding(dissimilarity_matrix = my_matrix, # parameter name changed # ... other parameters (unchanged) ) Changes in version 1.0.0 (2025-07-11) - All exported methods now include \value documentation describing the output's class, structure, and meaning. - Examples for unexported functions have been omitted, and \dontrun{} wrappers have been removed5. Slower examples are now wrapped in \donttest{} as appropriate. - Functions no longer write to user directories by default. Functions where writing a file is the main purpose now require the user to specify an output directory. - The complex distributed processing functionality has been removed, as it was not essential for typical use cases. - The link to our paper and citation information have been updated. Changes in version 0.3.2 - Initial release to CRAN (revised per CRAN reviewr's instructions). - Introduces the Topolow algorithm, a physics-inspired method for antigenic cartography. - Provides robust mapping and complete positioning of all antigens, even with highly sparse datasets (>95% missing values). - Implements automatic, likelihood-based estimation to determine the optimal dimensionality of the antigenic map. - Includes functionality to calculate "antigenic velocity" vectors to quantify the rate and direction of antigenic drift. - Features tools for handling and processing cross-reactivity and binding affinity assay data, including those with thresholded values. - Demonstrates improved prediction accuracy and run-to-run stability compared to traditional MDS methods.