NEWS

topolow 2.1.0

Breaking Changes

create_diagnostic_plots() renamed to plot_mcmc_diagnostics() and its parameter res renamed to dpi for naming consistency

New Features

Subsampling for Computational Efficiency

Major Feature: Added the optional opt_subsample parameter to key optimization functions, enabling efficient parameter optimization on large datasets while maintaining final embedding quality. Parameter optimization still works reliably with subsampling, because likelihoods of samples of the same size are comparable, allowing us to choose the optimal parameter values.
New Functions:
- check_matrix_connectivity(): Validates that a dissimilarity matrix forms a connected graph
- subsample_dissimilarity_matrix(): Creates random subsamples with automatic connectivity validation and adaptive size adjustment
- sanity_check_subsample(): Validates subsample suitability for cross-validation
- prune_sparse_matrix(): Prunes sparse dissimilarity matrices to a well-connected subset
Enhanced Functions:
- initial_parameter_optimization(): Now accepts opt_subsample parameter
- run_adaptive_sampling(): Now accepts opt_subsample parameter
- adaptive_MC_sampling(): Now accepts opt_subsample parameter (internal)
- Euclidify(): Now accepts opt_subsample parameter

How Subsampling Works

When opt_subsample is specified:

Each parameter evaluation uses a random subsample of the specified size
Connectivity is automatically validated; disconnected subsamples are rejected
If connectivity fails, sample size needs to be increased
Different parameter evaluations use different subsamples for robustness
Final embedding always uses the full dataset

The opt_subsample parameter is optional (default: NULL = use full data).

Performance Benefits

Speeds up parameter optimization by ~10-50x on large datasets (>500 points)
Reduces memory usage proportional to subsample size
Parameters found on subsamples generalize well to full data

Other changes

Package gridExtra is a required import now.

Recommendations

Datasets < 500 points: Use full data (opt_subsample = NULL)
Datasets > 500 points: Recommended opt_subsample = 200-500
Always ensure opt_subsample >= folds for reliable cross-validation

Bug Fixes

Conversion of matrices to numeric in "R/adaptive_sampling.R" are now properly handled by extract_numeric_values() function of the Topolow package.

Improvements

Enhanced connectivity checking using igraph
Better error messages for disconnected data
Adaptive strategies for handling sparse data
Comprehensive logging of subsampling operations
New diagnostic plots including MCMC exploration and parameter fit traces

New changes towards v3:

#' The function performs these steps in an epoch-based evolutionary strategy: #' 1. Initialization: Starts with the user-provided parameter ranges. #' 2. Epoch Loop: For each epoch: #' a. Generates num_samples using LHS within the current parameter ranges. #' b. If opt_subsample is specified, each evaluation uses a random subsample. #' c. Evaluates parameter sets via cross-validation (in parallel batches). #' d. Range Update (after all but the final epoch): #' - Sorts results by NLL and keeps the top 50%. #' - Updates parameter ranges for the next epoch based on survivors: #' New Min = 0.75 * Min(Survivors), New Max = 1.25 * Max(Survivors). #' - This allows the search to drift and zoom in on optimal regions. #' 3. Finalization: Automatically log-transforms the results from the final epoch #' for direct use with adaptive sampling.

#' @param epochs Integer. Number of optimization epochs. In each epoch, parameters are sampled, #' evaluated, and the best 50% are used to refine the search space for the next epoch. #' Default: 3.

C++ Backend for Core Optimization (Performance)

Major Performance Enhancement: The core optimization loop in euclidean_embedding() has been rewritten in C++ using Rcpp and RcppArmadillo, providing significant speedups for large datasets. All for loops in the core function euclidean_embedding() have been replaced with vector operations.
New Algorithm: Negative Sampling (not used)
- Implements negative sampling to approximate unmeasured pair repulsion
- Reduces complexity from O(N²) per iteration to O(E × k), where E is the number of measured edges and k is the number of negative samples
- New parameter n_negative_samples (default: 5) controls the approximation quality vs. speed tradeoff
- Particularly beneficial for sparse matrices (>90% missing values)
New Parameters for euclidean_embedding():
- n_negative_samples: Number of negative samples per edge endpoint (default: 5). Higher values better approximate the original O(N²) algorithm but increase computation time.
- convergence_check_freq: How often to check for convergence in iterations (default: 10). Lower values give more precise stopping but add overhead.
Implementation Details:
- COO Format: Uses Coordinate List format for edge data to avoid sparse matrix zero-dropping issues
- Edge Shuffling: C++ native random number generator (std::mt19937) for stochastic edge ordering, critical for escaping local optima
- Immediate Updates: Preserves Gauss-Seidel style position updates from the original R implementation for identical convergence behavior
- Vectorized Error Calculation: Uses Armadillo batch operations for computing MAE during convergence checks
- Cache-Friendly Layout: Edge data stored in contiguous arrays for better CPU cache utilization
- Pre-computed Factors: Degree-based normalization factors computed once before optimization
- Direct Memory Access: Bypasses Armadillo accessors for position updates in the inner loop
Return Value Enhancement: The convergence field in the returned topolow object now includes:
- achieved: Boolean indicating whether convergence was reached
- error: Final MAE on active constraints
- final_k: Final spring constant value after cooling
Dependencies: Added RcppArmadillo to LinkingTo (compile-time only, no runtime dependency added)

Performance Comparison

| Dataset Size | Sparsity | R (v2.0) | C++ (v2.1) | Speedup | |--------------|----------|----------|------------|---------| | 100 points | 50% | ~2s | ~0.3s | ~7× | | 500 points | 80% | ~45s | ~4s | ~11× | | 1000 points | 95% | ~180s | ~12s | ~15× |

Benchmarks on 1000 iterations, 3 dimensions. Actual speedup varies with data characteristics.

Backward Compatibility

All existing code using euclidean_embedding() will work without modification
Default parameter values preserve original algorithm behavior
Output structure remains compatible with downstream functions (Euclidify(), parameter optimization, etc.)

topolow 2.0.1 (2025-08-30)

Included figures in the vignette.

topolow 2.0.0 (2025-08-19)

The wizard function Euclidify was added to run all the workflow needed to get the main output automatically.

Deprecations

create_topolow_map() is now deprecated in favor of euclidean_embedding(). The old function will be removed in version 3.0.0.
- Parameter name changed: distance_matrix --> dissimilarity_matrix
- Function name changed: create_topolow_map() --> euclidean_embedding()

Breaking Changes

initial_parameter_optimization(): Parameter distance_matrix renamed to dissimilarity_matrix
- Migration: Replace distance_matrix = your_matrix with dissimilarity_matrix = your_matrix
run_adaptive_sampling(): Parameter distance_matrix renamed to dissimilarity_matrix
- Migration: Replace distance_matrix = your_matrix with dissimilarity_matrix = your_matrix
adaptive_MC_sampling():
- Parameter distance_matrix renamed to dissimilarity_matrix
- Removed parameter batch_size from adaptive_MC_sampling(); its value had no effect in the processes anyway
- Removed parameter num_parallel_jobs from run_adaptive_sampling; set max_cores to define the number of cores and parallel jobs
- Migration: Replace distance_matrix = your_matrix with dissimilarity_matrix = your_matrix and remove batch_size arguments
create_cv_folds(): Parameter names and return structure changed
- Parameter changes: truth_matrix --> dissimilarity_matrix, no_noise_truth --> ground_truth_matrix
- Return structure: Now returns named list elements ($truth, $train) instead of indexed elements
- Migration: Update parameter names and change result[[1]][[1]] to result[[1]]$truth, result[[1]][[2]] to result[[1]]$train
take_log parameter in clean_data() is deprecated
- Perform log transformation before calling these functions instead
- Parameter will be removed in next major version
analyze_network_structure(): Parameter distance_matrix renamed to dissimilarity_matrix for consistency with other functions
calculate_diagnostics(): Return class changed from topolow_amcs_diagnostics to topolow_diagnostics for naming consistency
plot_network_structure(): Removed aesthetic_config and layout_config parameters
- Migration: Replace with width, height, dpi parameters
- Fixed aesthetic values improve consistency but reduce customization
- Added better handling for empty network cases
scatterplot_fitted_vs_true(): Parameter names updated for consistency
- Migration: distance_matrix --> dissimilarity_matrix, p_dist_mat --> p_dissimilarity_mat
- Migration: Default save_plot changed from TRUE to FALSE
- Improved modern ggplot2 syntax using linewidth instead of deprecated size
error_calculator_comparison(): Parameter names changed for consistency
- p_dist_mat --> predicted_dissimilarities
- truth_matrix --> true_dissimilarities
- input_matrix --> input_dissimilarities (now optional, defaults to NULL)
- Migration: Update all parameter names in function calls
calculate_prediction_interval(): Parameter names changed for consistency
- distance_matrix --> dissimilarity_matrix
- p_dist_mat --> predicted_dissimilarity_matrix
- Migration: Update parameter names in function calls
long_to_matrix was renamed to titers_list_to_matrix since it is specific to viral titer data processing.
Function process_antigenic_data accepts a data frame as input, instead of the previous form of a file path.
In process_antigenic_data, is_titer became is_similarity for clearity for broader audience. Parameter id_prefix was removed.

New Features

Added euclidean_embedding() function with enhanced performance and features:
- Matrix reordering: Automatic spectral ordering concentrates largest dissimilarity values in corners for improved optimization
- Enhanced validation: Better input data quality checks with informative warnings
- Improved documentation: More detailed examples and parameter guidance

Improvements

Package dependencies where reduced from 20 to 13
Enhanced algorithm documentation with clearer physics-inspired approach description
Better handling of edge cases in dissimilarity matrix processing
Improved error messages for parameter validation
Updated parameter_sensitivity function to use modern ggplot2 syntax
Improved input validation and error handling in sensitivity analysis
Enhanced MLE calculation algorithm for better robustness
Replaced deprecated size parameter with linewidth in plots
Enhanced input validation and error messages in create_cv_folds()
input_dissimilarities parameter now optional in error_calculator_comparison()
initial_parameter_optimization saves/returns the parameters in log scale, consistent with other function
A vignette was added

Deprecation Timeline

Version 2.0.0: create_topolow_map() deprecated, issues warning
Version 3.0.0 (planned): create_topolow_map() will be removed

Migration Guide

To update your code:

# Old (deprecated):
result <- create_topolow_map(distance_matrix = my_matrix, 
  # ... other parameters
)

# New (recommended):
result <- euclidean_embedding(dissimilarity_matrix = my_matrix,  # parameter name changed
  # ... other parameters (unchanged)
)

topolow 1.0.0 (2025-07-11)

All exported methods now include \value documentation describing the output's class, structure, and meaning.
Examples for unexported functions have been omitted, and \dontrun{} wrappers have been removed5. Slower examples are now wrapped in \donttest{} as appropriate.
Functions no longer write to user directories by default. Functions where writing a file is the main purpose now require the user to specify an output directory.
The complex distributed processing functionality has been removed, as it was not essential for typical use cases.
The link to our paper and citation information have been updated.

topolow 0.3.2

Initial release to CRAN (revised per CRAN reviewr's instructions).
Introduces the Topolow algorithm, a physics-inspired method for antigenic cartography.
Provides robust mapping and complete positioning of all antigens, even with highly sparse datasets (>95% missing values).
Implements automatic, likelihood-based estimation to determine the optimal dimensionality of the antigenic map.
Includes functionality to calculate "antigenic velocity" vectors to quantify the rate and direction of antigenic drift.
Features tools for handling and processing cross-reactivity and binding affinity assay data, including those with thresholded values.
Demonstrates improved prediction accuracy and run-to-run stability compared to traditional MDS methods.