88 Commits

Author SHA1 Message Date
Lorenzo Mec-iS
a62c293244 Add another pairwise distance algorithm 2025-01-28 00:30:57 +00:00
Lorenzo Mec-iS
39f87aa5c2 add tests to fastpair 2025-01-28 00:20:29 +00:00
Lorenzo Mec-iS
8cc02cdd48 fix test 2025-01-27 23:43:42 +00:00
Lorenzo Mec-iS
d60ba63862 Merge branch 'main' of github.com:smartcorelib/smartcore into march-2023-improvements 2025-01-27 23:34:45 +00:00
Lorenzo
5dd5c2f0d0 Merge branch 'development' into march-2023-improvements 2025-01-27 23:28:58 +00:00
Lorenzo
c8ec8fec00 Fix #245: return error for NaN in naive bayes (#246)
* Fix #245: return error for NaN in naive bayes
* Implement error handling for NaN values in NBayes predict:
* general behaviour has been kept unchanged according to original tests in `mod.rs`
* aka: error is returned only if all the predicted probabilities are NaN
* Add tests
* Add test with static values
* Add test for numerical stability with numpy
2025-01-27 23:17:55 +00:00
Lorenzo
3da433f757 Implement predict_proba for DecisionTreeClassifier (#287)
* Implement predict_proba for DecisionTreeClassifier
* Some automated fixes suggested by cargo clippy --fix
2025-01-20 18:50:00 +00:00
Lorenzo (Mec-iS)
074cfaf14f rustfmt 2023-03-24 12:06:54 +09:00
Lorenzo
393cf15534 Merge branch 'development' into march-2023-improvements 2023-03-24 12:05:06 +09:00
Lorenzo (Mec-iS)
80c406b37d Merge branch 'development' of github.com:smartcorelib/smartcore into march-2023-improvements 2023-03-21 17:38:35 +09:00
Lorenzo (Mec-iS)
0e1bf6ce7f Add ordered_pairs method to FastPair 2023-03-21 14:46:33 +09:00
Lorenzo (Mec-iS)
0c9c70f8d2 Merge 2022-11-09 12:05:17 +00:00
morenol
62de25b2ae Handle kernel serialization (#232)
* Handle kernel serialization
* Do not use typetag in WASM
* enable tests for serialization
* Update serde feature deps

Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
Co-authored-by: Lorenzo <tunedconsulting@gmail.com>
2022-11-08 11:29:56 -05:00
morenol
7d87451333 Fixes for release (#237)
* Fixes for release
* add new test
* Remove change applied in development branch
* Only add dependency for wasm32
* Update ci.yml

Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
Co-authored-by: Lorenzo <tunedconsulting@gmail.com>
2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
265fd558e7 make work cargo build --target wasm32-unknown-unknown 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
e25e2aea2b update CHANGELOG 2022-11-08 11:29:56 -05:00
Lorenzo
2f6dd1325e update comment 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
b0dece9476 use getrandom/js 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
c507d976be Update CHANGELOG 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
fa54d5ee86 Remove unused tests flags 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
459d558d48 minor fixes to doc 2022-11-08 11:29:56 -05:00
Lorenzo
1b7dda30a2 minor fix 2022-11-08 11:29:56 -05:00
Lorenzo
c1bd1df5f6 minor fix 2022-11-08 11:29:56 -05:00
Lorenzo
cf751f05aa minor fix 2022-11-08 11:29:56 -05:00
Lorenzo
63ed89aadd minor fix 2022-11-08 11:29:56 -05:00
Lorenzo
890e9d644c minor fix 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
af0a740394 Fix std_rand feature 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
616e38c282 cleanup 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
a449fdd4ea fmt 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
669f87f812 Use getrandom as default (for no-std feature) 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
6d529b34d2 Add static analyzer to doc 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
3ec9e4f0db Exclude datasets test for wasm/wasi 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
527477dea7 minor fixes 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
5b517c5048 minor fix 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
2df0795be9 Release 0.3 2022-11-08 11:29:56 -05:00
Lorenzo
0dc97a4e9b Create DEVELOPERS.md 2022-11-08 11:29:56 -05:00
Lorenzo
6c0fd37222 Update README.md 2022-11-08 11:29:56 -05:00
Lorenzo
d8d0fb6903 Update README.md 2022-11-08 11:29:56 -05:00
morenol
8d07efd921 Use Box in SVM and remove lifetimes (#228)
* Do not change external API
Authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
morenol
ba27dd2a55 Fix CI (#227)
* Update ci.yml
Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
Lorenzo
ed9769f651 Implement CSV reader with new traits (#209) 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
b427e5d8b1 Improve options conditionals 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
fabe362755 Implement Display for NaiveBayes 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
ee6b6a53d6 cargo clippy 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
19f3a2fcc0 Fix signature of metrics tests 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
e09c4ba724 Add kernels' parameters to public interface 2022-11-08 11:29:56 -05:00
Lorenzo
6624732a65 Fix svr tests (#222) 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
1cbde3ba22 Refactor modules structure in src/svm 2022-11-08 11:29:56 -05:00
Lorenzo (Mec-iS)
551a6e34a5 clean up svm 2022-11-08 11:29:56 -05:00
Lorenzo
c45bab491a Support Wasi as target (#216)
* Improve features
* Add wasm32-wasi as a target
* Update .github/workflows/ci.yml
Co-authored-by: morenol <22335041+morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
Lorenzo
7f35dc54e4 Disambiguate distances. Implement Fastpair. (#220) 2022-11-08 11:29:56 -05:00
morenol
8f1a7dfd79 build: fix compilation without default features (#218)
* build: fix compilation with optional features
* Remove unused config from Cargo.toml
* Fix cache keys
Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
Lorenzo
712c478af6 Improve features (#215) 2022-11-08 11:29:56 -05:00
Lorenzo
4d36b7f34f Fix metrics::auc (#212)
* Fix metrics::auc
2022-11-08 11:29:56 -05:00
Lorenzo
a16927aa16 Port ensemble. Add Display to naive_bayes (#208) 2022-11-08 11:29:56 -05:00
Lorenzo
d91f4f7ce4 Update README.md 2022-11-08 11:29:56 -05:00
Lorenzo
a7fa0585eb Merge potential next release v0.4 (#187) Breaking Changes
* First draft of the new n-dimensional arrays + NB use case
* Improves default implementation of multiple Array methods
* Refactors tree methods
* Adds matrix decomposition routines
* Adds matrix decomposition methods to ndarray and nalgebra bindings
* Refactoring + linear regression now uses array2
* Ridge & Linear regression
* LBFGS optimizer & logistic regression
* LBFGS optimizer & logistic regression
* Changes linear methods, metrics and model selection methods to new n-dimensional arrays
* Switches KNN and clustering algorithms to new n-d array layer
* Refactors distance metrics
* Optimizes knn and clustering methods
* Refactors metrics module
* Switches decomposition methods to n-dimensional arrays
* Linalg refactoring - cleanup rng merge (#172)
* Remove legacy DenseMatrix and BaseMatrix implementation. Port the new Number, FloatNumber and Array implementation into module structure.
* Exclude AUC metrics. Needs reimplementation
* Improve developers walkthrough

New traits system in place at `src/numbers` and `src/linalg`
Co-authored-by: Lorenzo <tunedconsulting@gmail.com>

* Provide SupervisedEstimator with a constructor to avoid explicit dynamical box allocation in 'cross_validate' and 'cross_validate_predict' as required by the use of 'dyn' as per Rust 2021
* Implement getters to use as_ref() in src/neighbors
* Implement getters to use as_ref() in src/naive_bayes
* Implement getters to use as_ref() in src/linear
* Add Clone to src/naive_bayes
* Change signature for cross_validate and other model_selection functions to abide to use of dyn in Rust 2021
* Implement ndarray-bindings. Remove FloatNumber from implementations
* Drop nalgebra-bindings support (as decided in conf-call to go for ndarray)
* Remove benches. Benches will have their own repo at smartcore-benches
* Implement SVC
* Implement SVC serialization. Move search parameters in dedicated module
* Implement SVR. Definitely too slow
* Fix compilation issues for wasm (#202)

Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
* Fix tests (#203)

* Port linalg/traits/stats.rs
* Improve methods naming
* Improve Display for DenseMatrix

Co-authored-by: Montana Low <montanalow@users.noreply.github.com>
Co-authored-by: VolodymyrOrlov <volodymyr.orlov@gmail.com>
2022-11-08 11:29:56 -05:00
RJ Nowling
a32eb66a6a Dataset doc cleanup (#205)
* Update iris.rs

* Update mod.rs

* Update digits.rs
2022-11-08 11:29:56 -05:00
Lorenzo
f605f6e075 Update README.md 2022-11-08 11:29:56 -05:00
Lorenzo
3b1aaaadf7 Update README.md 2022-11-08 11:29:56 -05:00
Lorenzo
d015b12402 Update CONTRIBUTING.md 2022-11-08 11:29:56 -05:00
morenol
d5200074c2 fix: fix issue with iterator for svc search (#182) 2022-11-08 11:29:56 -05:00
morenol
473cdfc44d refactor: Try to follow similar pattern to other APIs (#180)
Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
morenol
ad2e6c2900 feat: expose hyper tuning module in model_selection (#179)
* feat: expose hyper tuning module in model_selection

* Move to a folder

Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
Lorenzo
9ea3133c27 Update CONTRIBUTING.md 2022-11-08 11:29:56 -05:00
Lorenzo
e4c47c7540 Add contribution guidelines (#178) 2022-11-08 11:29:56 -05:00
Montana Low
f4fd4d2239 make default params available to serde (#167)
* add seed param to search params

* make default params available to serde

* lints

* create defaults for enums

* lint
2022-11-08 11:29:56 -05:00
Montana Low
05dfffad5c add seed param to search params (#168) 2022-11-08 11:29:56 -05:00
morenol
a37b552a7d Lmm/add seeds in more algorithms (#164)
* Provide better output in flaky tests

* feat: add seed parameter to multiple algorithms

* Update changelog

Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
Montana Low
55e1158581 Complete grid search params (#166)
* grid search draft

* hyperparam search for linear estimators

* grid search for ensembles

* support grid search for more algos

* grid search for unsupervised algos

* minor cleanup
2022-11-08 11:29:56 -05:00
morenol
cfa824d7db Provide better output in flaky tests (#163) 2022-11-08 11:29:56 -05:00
morenol
bb5b437a32 feat: allocate first and then proceed to create matrix from Vec of Ro… (#159)
* feat: allocate first and then proceed to create matrix from Vec of RowVectors
2022-11-08 11:29:56 -05:00
morenol
851533dfa7 Make rand_distr optional (#161) 2022-11-08 11:29:56 -05:00
Lorenzo
0d996edafe Update LICENSE 2022-11-08 11:29:56 -05:00
morenol
f291b71f4a fix: fix compilation warnings when running only with default features (#160)
* fix: fix compilation warnings when running only with default features
Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
Tim Toebrock
2d75c2c405 Implement a generic read_csv method (#147)
* feat: Add interface to build `Matrix` from rows.
* feat: Add option to derive `RealNumber` from string.
To construct a `Matrix` from csv, and therefore from string, I need to be able to deserialize a generic `RealNumber` from string.
* feat: Implement `Matrix::read_csv`.
2022-11-08 11:29:56 -05:00
Montana Low
1f2597be74 grid search (#154)
* grid search draft
* hyperparam search for linear estimators
2022-11-08 11:29:56 -05:00
Montana Low
0f442e96c0 Handle multiclass precision/recall (#152)
* handle multiclass precision/recall
2022-11-08 11:29:56 -05:00
dependabot[bot]
44e4be23a6 Update criterion requirement from 0.3 to 0.4 (#150)
* Update criterion requirement from 0.3 to 0.4

Updates the requirements on [criterion](https://github.com/bheisler/criterion.rs) to permit the latest version.
- [Release notes](https://github.com/bheisler/criterion.rs/releases)
- [Changelog](https://github.com/bheisler/criterion.rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/bheisler/criterion.rs/compare/0.3.0...0.4.0)

---
updated-dependencies:
- dependency-name: criterion
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix criterion

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
Christos Katsakioris
01f753f86d Add serde for StandardScaler (#148)
* Derive `serde::Serialize` and `serde::Deserialize` for
  `StandardScaler`.
* Add relevant unit test.

Signed-off-by: Christos Katsakioris <ckatsak@gmail.com>

Signed-off-by: Christos Katsakioris <ckatsak@gmail.com>
2022-11-08 11:29:56 -05:00
Tim Toebrock
df766eaf79 Implementation of Standard scaler (#143)
* docs: Fix typo in doc for categorical transformer.
* feat: Add option to take a column from Matrix.
I created the method `Matrix::take_column` that uses the `Matrix::take`-interface to extract a single column from a matrix. I need that feature in the implementation of  `StandardScaler`.
* feat: Add `StandardScaler`.
Authored-by: titoeb <timtoebrock@googlemail.com>
2022-11-08 11:29:56 -05:00
Lorenzo
09d9205696 Add example for FastPair (#144)
* Add example

* Move to top

* Add imports to example

* Fix imports
2022-11-08 11:29:56 -05:00
Lorenzo
dc7f01db4a Implement fastpair (#142)
* initial fastpair implementation
* FastPair initial implementation
* implement fastpair
* Add random test
* Add bench for fastpair
* Refactor with constructor for FastPair
* Add serialization for PairwiseDistance
* Add fp_bench feature for fastpair bench
2022-11-08 11:29:56 -05:00
Chris McComb
eb4b49d552 Added additional doctest and fixed indices (#141) 2022-11-08 11:29:56 -05:00
morenol
98e3465e7b Fix clippy warnings (#139)
Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
ferrouille
ea39024fd2 Add SVC::decision_function (#135) 2022-11-08 11:29:56 -05:00
dependabot[bot]
4e94feb872 Update nalgebra requirement from 0.23.0 to 0.31.0 (#128)
Updates the requirements on [nalgebra](https://github.com/dimforge/nalgebra) to permit the latest version.
- [Release notes](https://github.com/dimforge/nalgebra/releases)
- [Changelog](https://github.com/dimforge/nalgebra/blob/dev/CHANGELOG.md)
- [Commits](https://github.com/dimforge/nalgebra/compare/v0.23.0...v0.31.0)

---
updated-dependencies:
- dependency-name: nalgebra
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
dependabot-preview[bot]
fa802d2d3f build(deps): update nalgebra requirement from 0.23.0 to 0.26.2 (#98)
* build(deps): update nalgebra requirement from 0.23.0 to 0.26.2

Updates the requirements on [nalgebra](https://github.com/dimforge/nalgebra) to permit the latest version.
- [Release notes](https://github.com/dimforge/nalgebra/releases)
- [Changelog](https://github.com/dimforge/nalgebra/blob/dev/CHANGELOG.md)
- [Commits](https://github.com/dimforge/nalgebra/compare/v0.23.0...v0.26.2)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

* fix: updates for nalgebre

* test: explicitly call pow_mut from BaseVector since now it conflicts with nalgebra implementation

* Don't be strict with dependencies

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Co-authored-by: Luis Moreno <morenol@users.noreply.github.com>
2022-11-08 11:29:56 -05:00
16 changed files with 976 additions and 100 deletions
+219
View File
@@ -0,0 +1,219 @@
//! This module provides FastPair, a data-structure for efficiently tracking the dynamic
//! closest pairs in a set of points, with an example usage in hierarchical clustering.[2][3][5]
//!
//! ## Purpose
//!
//! FastPair allows quick retrieval of the nearest neighbor for each data point by maintaining
//! a "conga line" of closest pairs. Each point retains a link to its known nearest neighbor,
//! and updates in the data structure propagate accordingly. This can be leveraged in
//! agglomerative clustering steps, where merging or insertion of new points must be reflected
//! in nearest-neighbor relationships.
//!
//! ## Example
//!
//! ```
//! use smartcore::metrics::distance::PairwiseDistance;
//! use smartcore::linalg::basic::matrix::DenseMatrix;
//! use smartcore::algorithm::neighbour::fastpair::FastPair;
//!
//! let x = DenseMatrix::from_2d_array(&[
//! &[5.1, 3.5, 1.4, 0.2],
//! &[4.9, 3.0, 1.4, 0.2],
//! &[4.7, 3.2, 1.3, 0.2],
//! &[4.6, 3.1, 1.5, 0.2],
//! &[5.0, 3.6, 1.4, 0.2],
//! &[5.4, 3.9, 1.7, 0.4],
//! ]).unwrap();
//!
//! let fastpair = FastPair::new(&x).unwrap();
//! let closest = fastpair.closest_pair();
//! println!("Closest pair: {:?}", closest);
//! ```
use std::collections::HashMap;
use num::Bounded;
use crate::error::{Failed, FailedError};
use crate::linalg::basic::arrays::{Array, Array1, Array2};
use crate::metrics::distance::euclidian::Euclidian;
use crate::metrics::distance::PairwiseDistance;
use crate::numbers::floatnum::FloatNumber;
use crate::numbers::realnum::RealNumber;
/// Eppstein dynamic closet-pair structure
/// 'M' can be a matrix-like trait that provides row access
#[derive(Debug)]
pub struct EppsteinDCP<'a, T: RealNumber + FloatNumber, M: Array2<T>> {
samples: &'a M,
// "buckets" store, for each row, a small structure recording potential neighbors
neighbors: HashMap<usize, PairwiseDistance<T>>,
}
impl<'a, T: RealNumber + FloatNumber, M: Array2<T>> EppsteinDCP<'a, T, M> {
/// Creates a new EppsteinDCP instance with the given data
pub fn new(m: &'a M) -> Result<Self, Failed> {
if m.shape().0 < 3 {
return Err(Failed::because(
FailedError::FindFailed,
"min number of rows should be 3",
));
}
let mut this = Self {
samples: m,
neighbors: HashMap::with_capacity(m.shape().0),
};
this.initialize();
Ok(this)
}
/// Build an initial "conga line" or chain of potential neighbors
/// akin to Eppsteins technique[2].
fn initialize(&mut self) {
let n = self.samples.shape().0;
if n < 2 {
return;
}
// Assign each row i some large distance by default
for i in 0..n {
self.neighbors.insert(
i,
PairwiseDistance {
node: i,
neighbour: None,
distance: Some(<T as Bounded>::max_value()),
},
);
}
// Example: link each i to the next, forming a chain
// (depending on the actual Eppstein approach, can refine)
for i in 0..(n - 1) {
let dist = self.compute_dist(i, i + 1);
self.neighbors.entry(i).and_modify(|pd| {
pd.neighbour = Some(i + 1);
pd.distance = Some(dist);
});
}
// Potential refinement steps omitted for brevity
}
/// Insert a point into the structure.
pub fn insert(&mut self, row_idx: usize) {
// Expand data, find neighbor to link with
// For example, link row_idx to nearest among existing
let mut best_neighbor = None;
let mut best_d = <T as Bounded>::max_value();
for (i, _) in &self.neighbors {
let d = self.compute_dist(*i, row_idx);
if d < best_d {
best_d = d;
best_neighbor = Some(*i);
}
}
self.neighbors.insert(
row_idx,
PairwiseDistance {
node: row_idx,
neighbour: best_neighbor,
distance: Some(best_d),
},
);
// For the best_neighbor, you might want to see if row_idx becomes closer
if let Some(kn) = best_neighbor {
let dist = self.compute_dist(row_idx, kn);
let entry = self.neighbors.get_mut(&kn).unwrap();
if dist < entry.distance.unwrap() {
entry.neighbour = Some(row_idx);
entry.distance = Some(dist);
}
}
}
/// For hierarchical clustering, discover minimal pairs, then merge
pub fn closest_pair(&self) -> Option<PairwiseDistance<T>> {
let mut min_pair: Option<PairwiseDistance<T>> = None;
for (_, pd) in &self.neighbors {
if let Some(d) = pd.distance {
if min_pair.is_none() || d < min_pair.as_ref().unwrap().distance.unwrap() {
min_pair = Some(pd.clone());
}
}
}
min_pair
}
fn compute_dist(&self, i: usize, j: usize) -> T {
// Example: Euclidean
let row_i = self.samples.get_row(i);
let row_j = self.samples.get_row(j);
row_i
.iterator(0)
.zip(row_j.iterator(0))
.map(|(a, b)| (*a - *b) * (*a - *b))
.sum()
}
}
/// Simple usage
#[cfg(test)]
mod tests_eppstein {
use super::*;
use crate::linalg::basic::matrix::DenseMatrix;
#[test]
fn test_eppstein() {
let matrix =
DenseMatrix::from_2d_array(&[&vec![1.0, 2.0], &vec![2.0, 2.0], &vec![5.0, 3.0]])
.unwrap();
let mut dcp = EppsteinDCP::new(&matrix).unwrap();
dcp.insert(2);
let cp = dcp.closest_pair();
assert!(cp.is_some());
}
#[test]
fn compare_fastpair_eppstein() {
use crate::algorithm::neighbour::fastpair::FastPair;
// Assuming EppsteinDCP is implemented in a similar module
use crate::algorithm::neighbour::eppstein::EppsteinDCP;
// Create a static example matrix
let x = DenseMatrix::from_2d_array(&[
&[5.1, 3.5, 1.4, 0.2],
&[4.9, 3.0, 1.4, 0.2],
&[4.7, 3.2, 1.3, 0.2],
&[4.6, 3.1, 1.5, 0.2],
&[5.0, 3.6, 1.4, 0.2],
&[5.4, 3.9, 1.7, 0.4],
&[4.6, 3.4, 1.4, 0.3],
&[5.0, 3.4, 1.5, 0.2],
&[4.4, 2.9, 1.4, 0.2],
&[4.9, 3.1, 1.5, 0.1],
])
.unwrap();
// Build FastPair
let fastpair = FastPair::new(&x).unwrap();
let pair_fastpair = fastpair.closest_pair();
// Build EppsteinDCP
let eppstein = EppsteinDCP::new(&x).unwrap();
let pair_eppstein = eppstein.closest_pair();
// Compare the results
assert_eq!(pair_fastpair.node, pair_eppstein.as_ref().unwrap().node);
assert_eq!(
pair_fastpair.neighbour.unwrap(),
pair_eppstein.as_ref().unwrap().neighbour.unwrap()
);
// Use a small epsilon for floating-point comparison
let epsilon = 1e-9;
let diff: f64 =
pair_fastpair.distance.unwrap() - pair_eppstein.as_ref().unwrap().distance.unwrap();
assert!(diff.abs() < epsilon);
println!("FastPair result: {:?}", pair_fastpair);
println!("EppsteinDCP result: {:?}", pair_eppstein);
}
}
+117 -1
View File
@@ -173,6 +173,21 @@ impl<'a, T: RealNumber + FloatNumber, M: Array2<T>> FastPair<'a, T, M> {
}
}
///
/// Return order dissimilarities from closest to furthest
///
#[allow(dead_code)]
pub fn ordered_pairs(&self) -> std::vec::IntoIter<&PairwiseDistance<T>> {
// improvement: implement this to return `impl Iterator<Item = &PairwiseDistance<T>>`
// need to implement trait `Iterator` for `Vec<&PairwiseDistance<T>>`
let mut distances = self
.distances
.values()
.collect::<Vec<&PairwiseDistance<T>>>();
distances.sort_by(|a, b| a.partial_cmp(b).unwrap());
distances.into_iter()
}
//
// Compute distances from input to all other points in data-structure.
// input is the row index of the sample matrix
@@ -212,7 +227,9 @@ mod tests_fastpair {
use crate::linalg::basic::{arrays::Array, matrix::DenseMatrix};
/// Brute force algorithm, used only for comparison and testing
pub fn closest_pair_brute(fastpair: &FastPair<f64, DenseMatrix<f64>>) -> PairwiseDistance<f64> {
pub fn closest_pair_brute(
fastpair: &FastPair<'_, f64, DenseMatrix<f64>>,
) -> PairwiseDistance<f64> {
use itertools::Itertools;
let m = fastpair.samples.shape().0;
@@ -586,4 +603,103 @@ mod tests_fastpair {
assert_eq!(closest, min_dissimilarity);
}
#[test]
fn fastpair_ordered_pairs() {
let x = DenseMatrix::<f64>::from_2d_array(&[
&[5.1, 3.5, 1.4, 0.2],
&[4.9, 3.0, 1.4, 0.2],
&[4.7, 3.2, 1.3, 0.2],
&[4.6, 3.1, 1.5, 0.2],
&[5.0, 3.6, 1.4, 0.2],
&[5.4, 3.9, 1.7, 0.4],
&[4.9, 3.1, 1.5, 0.1],
&[7.0, 3.2, 4.7, 1.4],
&[6.4, 3.2, 4.5, 1.5],
&[6.9, 3.1, 4.9, 1.5],
&[5.5, 2.3, 4.0, 1.3],
&[6.5, 2.8, 4.6, 1.5],
&[4.6, 3.4, 1.4, 0.3],
&[5.0, 3.4, 1.5, 0.2],
&[4.4, 2.9, 1.4, 0.2],
])
.unwrap();
let fastpair = FastPair::new(&x).unwrap();
let ordered = fastpair.ordered_pairs();
let mut previous: f64 = -1.0;
for p in ordered {
if previous == -1.0 {
previous = p.distance.unwrap();
} else {
let current = p.distance.unwrap();
assert!(current >= previous);
previous = current;
}
}
}
#[test]
fn test_empty_set() {
let empty_matrix = DenseMatrix::<f64>::zeros(0, 0);
let result = FastPair::new(&empty_matrix);
assert!(result.is_err());
if let Err(e) = result {
assert_eq!(
e,
Failed::because(FailedError::FindFailed, "min number of rows should be 3")
);
}
}
#[test]
fn test_single_point() {
let single_point = DenseMatrix::from_2d_array(&[&[1.0, 2.0, 3.0]]).unwrap();
let result = FastPair::new(&single_point);
assert!(result.is_err());
if let Err(e) = result {
assert_eq!(
e,
Failed::because(FailedError::FindFailed, "min number of rows should be 3")
);
}
}
#[test]
fn test_two_points() {
let two_points = DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0]]).unwrap();
let result = FastPair::new(&two_points);
assert!(result.is_err());
if let Err(e) = result {
assert_eq!(
e,
Failed::because(FailedError::FindFailed, "min number of rows should be 3")
);
}
}
#[test]
fn test_three_identical_points() {
let identical_points =
DenseMatrix::from_2d_array(&[&[1.0, 1.0], &[1.0, 1.0], &[1.0, 1.0]]).unwrap();
let result = FastPair::new(&identical_points);
assert!(result.is_ok());
let fastpair = result.unwrap();
let closest_pair = fastpair.closest_pair();
assert_eq!(closest_pair.distance, Some(0.0));
}
#[test]
fn test_result_unwrapping() {
let valid_matrix =
DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0], &[5.0, 6.0], &[7.0, 8.0]])
.unwrap();
let result = FastPair::new(&valid_matrix);
assert!(result.is_ok());
// This should not panic
let _fastpair = result.unwrap();
}
}
+3 -1
View File
@@ -41,7 +41,9 @@ use serde::{Deserialize, Serialize};
pub(crate) mod bbd_tree;
/// tree data structure for fast nearest neighbor search
pub mod cover_tree;
/// fastpair closest neighbour algorithm
/// eppstein pairwise closest neighbour algorithm
pub mod eppstein;
/// fastpair pairwise closest neighbour algorithm
pub mod fastpair;
/// very simple algorithm that sequentially checks each element of the list until a match is found or the whole list has been searched.
pub mod linear_search;
-1
View File
@@ -7,7 +7,6 @@
clippy::approx_constant
)]
#![warn(missing_docs)]
#![warn(rustdoc::missing_doc_code_examples)]
//! # smartcore
//!
+12 -13
View File
@@ -91,7 +91,7 @@ impl<'a, T: Debug + Display + Copy + Sized> DenseMatrixView<'a, T> {
}
}
impl<'a, T: Debug + Display + Copy + Sized> fmt::Display for DenseMatrixView<'a, T> {
impl<T: Debug + Display + Copy + Sized> fmt::Display for DenseMatrixView<'_, T> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
writeln!(
f,
@@ -142,7 +142,7 @@ impl<'a, T: Debug + Display + Copy + Sized> DenseMatrixMutView<'a, T> {
}
}
fn iter_mut<'b>(&'b mut self, axis: u8) -> Box<dyn Iterator<Item = &mut T> + 'b> {
fn iter_mut<'b>(&'b mut self, axis: u8) -> Box<dyn Iterator<Item = &'b mut T> + 'b> {
let column_major = self.column_major;
let stride = self.stride;
let ptr = self.values.as_mut_ptr();
@@ -169,7 +169,7 @@ impl<'a, T: Debug + Display + Copy + Sized> DenseMatrixMutView<'a, T> {
}
}
impl<'a, T: Debug + Display + Copy + Sized> fmt::Display for DenseMatrixMutView<'a, T> {
impl<T: Debug + Display + Copy + Sized> fmt::Display for DenseMatrixMutView<'_, T> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
writeln!(
f,
@@ -493,7 +493,7 @@ impl<T: Number + RealNumber> EVDDecomposable<T> for DenseMatrix<T> {}
impl<T: Number + RealNumber> LUDecomposable<T> for DenseMatrix<T> {}
impl<T: Number + RealNumber> SVDDecomposable<T> for DenseMatrix<T> {}
impl<'a, T: Debug + Display + Copy + Sized> Array<T, (usize, usize)> for DenseMatrixView<'a, T> {
impl<T: Debug + Display + Copy + Sized> Array<T, (usize, usize)> for DenseMatrixView<'_, T> {
fn get(&self, pos: (usize, usize)) -> &T {
if self.column_major {
&self.values[pos.0 + pos.1 * self.stride]
@@ -515,7 +515,7 @@ impl<'a, T: Debug + Display + Copy + Sized> Array<T, (usize, usize)> for DenseMa
}
}
impl<'a, T: Debug + Display + Copy + Sized> Array<T, usize> for DenseMatrixView<'a, T> {
impl<T: Debug + Display + Copy + Sized> Array<T, usize> for DenseMatrixView<'_, T> {
fn get(&self, i: usize) -> &T {
if self.nrows == 1 {
if self.column_major {
@@ -553,11 +553,11 @@ impl<'a, T: Debug + Display + Copy + Sized> Array<T, usize> for DenseMatrixView<
}
}
impl<'a, T: Debug + Display + Copy + Sized> ArrayView2<T> for DenseMatrixView<'a, T> {}
impl<T: Debug + Display + Copy + Sized> ArrayView2<T> for DenseMatrixView<'_, T> {}
impl<'a, T: Debug + Display + Copy + Sized> ArrayView1<T> for DenseMatrixView<'a, T> {}
impl<T: Debug + Display + Copy + Sized> ArrayView1<T> for DenseMatrixView<'_, T> {}
impl<'a, T: Debug + Display + Copy + Sized> Array<T, (usize, usize)> for DenseMatrixMutView<'a, T> {
impl<T: Debug + Display + Copy + Sized> Array<T, (usize, usize)> for DenseMatrixMutView<'_, T> {
fn get(&self, pos: (usize, usize)) -> &T {
if self.column_major {
&self.values[pos.0 + pos.1 * self.stride]
@@ -579,9 +579,7 @@ impl<'a, T: Debug + Display + Copy + Sized> Array<T, (usize, usize)> for DenseMa
}
}
impl<'a, T: Debug + Display + Copy + Sized> MutArray<T, (usize, usize)>
for DenseMatrixMutView<'a, T>
{
impl<T: Debug + Display + Copy + Sized> MutArray<T, (usize, usize)> for DenseMatrixMutView<'_, T> {
fn set(&mut self, pos: (usize, usize), x: T) {
if self.column_major {
self.values[pos.0 + pos.1 * self.stride] = x;
@@ -595,15 +593,16 @@ impl<'a, T: Debug + Display + Copy + Sized> MutArray<T, (usize, usize)>
}
}
impl<'a, T: Debug + Display + Copy + Sized> MutArrayView2<T> for DenseMatrixMutView<'a, T> {}
impl<T: Debug + Display + Copy + Sized> MutArrayView2<T> for DenseMatrixMutView<'_, T> {}
impl<'a, T: Debug + Display + Copy + Sized> ArrayView2<T> for DenseMatrixMutView<'a, T> {}
impl<T: Debug + Display + Copy + Sized> ArrayView2<T> for DenseMatrixMutView<'_, T> {}
impl<T: RealNumber> MatrixStats<T> for DenseMatrix<T> {}
impl<T: RealNumber> MatrixPreprocessing<T> for DenseMatrix<T> {}
#[cfg(test)]
#[warn(clippy::reversed_empty_ranges)]
mod tests {
use super::*;
use approx::relative_eq;
+6 -6
View File
@@ -119,7 +119,7 @@ impl<T: Debug + Display + Copy + Sized> Array1<T> for Vec<T> {
}
}
impl<'a, T: Debug + Display + Copy + Sized> Array<T, usize> for VecMutView<'a, T> {
impl<T: Debug + Display + Copy + Sized> Array<T, usize> for VecMutView<'_, T> {
fn get(&self, i: usize) -> &T {
&self.ptr[i]
}
@@ -138,7 +138,7 @@ impl<'a, T: Debug + Display + Copy + Sized> Array<T, usize> for VecMutView<'a, T
}
}
impl<'a, T: Debug + Display + Copy + Sized> MutArray<T, usize> for VecMutView<'a, T> {
impl<T: Debug + Display + Copy + Sized> MutArray<T, usize> for VecMutView<'_, T> {
fn set(&mut self, i: usize, x: T) {
self.ptr[i] = x;
}
@@ -149,10 +149,10 @@ impl<'a, T: Debug + Display + Copy + Sized> MutArray<T, usize> for VecMutView<'a
}
}
impl<'a, T: Debug + Display + Copy + Sized> ArrayView1<T> for VecMutView<'a, T> {}
impl<'a, T: Debug + Display + Copy + Sized> MutArrayView1<T> for VecMutView<'a, T> {}
impl<T: Debug + Display + Copy + Sized> ArrayView1<T> for VecMutView<'_, T> {}
impl<T: Debug + Display + Copy + Sized> MutArrayView1<T> for VecMutView<'_, T> {}
impl<'a, T: Debug + Display + Copy + Sized> Array<T, usize> for VecView<'a, T> {
impl<T: Debug + Display + Copy + Sized> Array<T, usize> for VecView<'_, T> {
fn get(&self, i: usize) -> &T {
&self.ptr[i]
}
@@ -171,7 +171,7 @@ impl<'a, T: Debug + Display + Copy + Sized> Array<T, usize> for VecView<'a, T> {
}
}
impl<'a, T: Debug + Display + Copy + Sized> ArrayView1<T> for VecView<'a, T> {}
impl<T: Debug + Display + Copy + Sized> ArrayView1<T> for VecView<'_, T> {}
#[cfg(test)]
mod tests {
+6 -10
View File
@@ -68,7 +68,7 @@ impl<T: Debug + Display + Copy + Sized> ArrayView2<T> for ArrayBase<OwnedRepr<T>
impl<T: Debug + Display + Copy + Sized> MutArrayView2<T> for ArrayBase<OwnedRepr<T>, Ix2> {}
impl<'a, T: Debug + Display + Copy + Sized> BaseArray<T, (usize, usize)> for ArrayView<'a, T, Ix2> {
impl<T: Debug + Display + Copy + Sized> BaseArray<T, (usize, usize)> for ArrayView<'_, T, Ix2> {
fn get(&self, pos: (usize, usize)) -> &T {
&self[[pos.0, pos.1]]
}
@@ -144,11 +144,9 @@ impl<T: Number + RealNumber> EVDDecomposable<T> for ArrayBase<OwnedRepr<T>, Ix2>
impl<T: Number + RealNumber> LUDecomposable<T> for ArrayBase<OwnedRepr<T>, Ix2> {}
impl<T: Number + RealNumber> SVDDecomposable<T> for ArrayBase<OwnedRepr<T>, Ix2> {}
impl<'a, T: Debug + Display + Copy + Sized> ArrayView2<T> for ArrayView<'a, T, Ix2> {}
impl<T: Debug + Display + Copy + Sized> ArrayView2<T> for ArrayView<'_, T, Ix2> {}
impl<'a, T: Debug + Display + Copy + Sized> BaseArray<T, (usize, usize)>
for ArrayViewMut<'a, T, Ix2>
{
impl<T: Debug + Display + Copy + Sized> BaseArray<T, (usize, usize)> for ArrayViewMut<'_, T, Ix2> {
fn get(&self, pos: (usize, usize)) -> &T {
&self[[pos.0, pos.1]]
}
@@ -175,9 +173,7 @@ impl<'a, T: Debug + Display + Copy + Sized> BaseArray<T, (usize, usize)>
}
}
impl<'a, T: Debug + Display + Copy + Sized> MutArray<T, (usize, usize)>
for ArrayViewMut<'a, T, Ix2>
{
impl<T: Debug + Display + Copy + Sized> MutArray<T, (usize, usize)> for ArrayViewMut<'_, T, Ix2> {
fn set(&mut self, pos: (usize, usize), x: T) {
self[[pos.0, pos.1]] = x
}
@@ -195,9 +191,9 @@ impl<'a, T: Debug + Display + Copy + Sized> MutArray<T, (usize, usize)>
}
}
impl<'a, T: Debug + Display + Copy + Sized> MutArrayView2<T> for ArrayViewMut<'a, T, Ix2> {}
impl<T: Debug + Display + Copy + Sized> MutArrayView2<T> for ArrayViewMut<'_, T, Ix2> {}
impl<'a, T: Debug + Display + Copy + Sized> ArrayView2<T> for ArrayViewMut<'a, T, Ix2> {}
impl<T: Debug + Display + Copy + Sized> ArrayView2<T> for ArrayViewMut<'_, T, Ix2> {}
#[cfg(test)]
mod tests {
+6 -6
View File
@@ -41,7 +41,7 @@ impl<T: Debug + Display + Copy + Sized> ArrayView1<T> for ArrayBase<OwnedRepr<T>
impl<T: Debug + Display + Copy + Sized> MutArrayView1<T> for ArrayBase<OwnedRepr<T>, Ix1> {}
impl<'a, T: Debug + Display + Copy + Sized> BaseArray<T, usize> for ArrayView<'a, T, Ix1> {
impl<T: Debug + Display + Copy + Sized> BaseArray<T, usize> for ArrayView<'_, T, Ix1> {
fn get(&self, i: usize) -> &T {
&self[i]
}
@@ -60,9 +60,9 @@ impl<'a, T: Debug + Display + Copy + Sized> BaseArray<T, usize> for ArrayView<'a
}
}
impl<'a, T: Debug + Display + Copy + Sized> ArrayView1<T> for ArrayView<'a, T, Ix1> {}
impl<T: Debug + Display + Copy + Sized> ArrayView1<T> for ArrayView<'_, T, Ix1> {}
impl<'a, T: Debug + Display + Copy + Sized> BaseArray<T, usize> for ArrayViewMut<'a, T, Ix1> {
impl<T: Debug + Display + Copy + Sized> BaseArray<T, usize> for ArrayViewMut<'_, T, Ix1> {
fn get(&self, i: usize) -> &T {
&self[i]
}
@@ -81,7 +81,7 @@ impl<'a, T: Debug + Display + Copy + Sized> BaseArray<T, usize> for ArrayViewMut
}
}
impl<'a, T: Debug + Display + Copy + Sized> MutArray<T, usize> for ArrayViewMut<'a, T, Ix1> {
impl<T: Debug + Display + Copy + Sized> MutArray<T, usize> for ArrayViewMut<'_, T, Ix1> {
fn set(&mut self, i: usize, x: T) {
self[i] = x;
}
@@ -92,8 +92,8 @@ impl<'a, T: Debug + Display + Copy + Sized> MutArray<T, usize> for ArrayViewMut<
}
}
impl<'a, T: Debug + Display + Copy + Sized> ArrayView1<T> for ArrayViewMut<'a, T, Ix1> {}
impl<'a, T: Debug + Display + Copy + Sized> MutArrayView1<T> for ArrayViewMut<'a, T, Ix1> {}
impl<T: Debug + Display + Copy + Sized> ArrayView1<T> for ArrayViewMut<'_, T, Ix1> {}
impl<T: Debug + Display + Copy + Sized> MutArrayView1<T> for ArrayViewMut<'_, T, Ix1> {}
impl<T: Debug + Display + Copy + Sized> Array1<T> for ArrayBase<OwnedRepr<T>, Ix1> {
fn slice<'a>(&'a self, range: Range<usize>) -> Box<dyn ArrayView1<T> + 'a> {
-1
View File
@@ -142,7 +142,6 @@ pub trait MatrixPreprocessing<T: RealNumber>: MutArrayView2<T> + Clone {
///
/// assert_eq!(a, expected);
/// ```
fn binarize_mut(&mut self, threshold: T) {
let (nrows, ncols) = self.shape();
for row in 0..nrows {
+4 -4
View File
@@ -258,8 +258,8 @@ impl<TX: Number + FloatNumber + RealNumber, TY: Number + Ord, X: Array2<TX>, Y:
}
}
impl<'a, T: Number + FloatNumber, X: Array2<T>> ObjectiveFunction<T, X>
for BinaryObjectiveFunction<'a, T, X>
impl<T: Number + FloatNumber, X: Array2<T>> ObjectiveFunction<T, X>
for BinaryObjectiveFunction<'_, T, X>
{
fn f(&self, w_bias: &[T]) -> T {
let mut f = T::zero();
@@ -313,8 +313,8 @@ struct MultiClassObjectiveFunction<'a, T: Number + FloatNumber, X: Array2<T>> {
_phantom_t: PhantomData<T>,
}
impl<'a, T: Number + FloatNumber + RealNumber, X: Array2<T>> ObjectiveFunction<T, X>
for MultiClassObjectiveFunction<'a, T, X>
impl<T: Number + FloatNumber + RealNumber, X: Array2<T>> ObjectiveFunction<T, X>
for MultiClassObjectiveFunction<'_, T, X>
{
fn f(&self, w_bias: &[T]) -> T {
let mut f = T::zero();
+473 -36
View File
@@ -40,7 +40,7 @@ use crate::linalg::basic::arrays::{Array1, Array2, ArrayView1};
use crate::numbers::basenum::Number;
#[cfg(feature = "serde")]
use serde::{Deserialize, Serialize};
use std::{cmp::Ordering, marker::PhantomData};
use std::marker::PhantomData;
/// Distribution used in the Naive Bayes classifier.
pub(crate) trait NBDistribution<X: Number, Y: Number>: Clone {
@@ -93,42 +93,42 @@ impl<TX: Number, TY: Number, X: Array2<TX>, Y: Array1<TY>, D: NBDistribution<TX,
/// Returns a vector of size N with class estimates.
pub fn predict(&self, x: &X) -> Result<Y, Failed> {
let y_classes = self.distribution.classes();
let predictions = x
.row_iter()
.map(|row| {
y_classes
.iter()
.enumerate()
.map(|(class_index, class)| {
(
class,
self.distribution.log_likelihood(class_index, &row)
+ self.distribution.prior(class_index).ln(),
)
})
// For some reason, the max_by method cannot use NaNs for finding the maximum value, it panics.
// NaN must be considered as minimum values,
// therefore it's like NaNs would not be considered for choosing the maximum value.
// So we need to handle this case for avoiding panicking by using `Option::unwrap`.
.max_by(|(_, p1), (_, p2)| match p1.partial_cmp(p2) {
Some(ordering) => ordering,
None => {
if p1.is_nan() {
Ordering::Less
} else if p2.is_nan() {
Ordering::Greater
if y_classes.is_empty() {
return Err(Failed::predict("Failed to predict, no classes available"));
}
let (rows, _) = x.shape();
let mut predictions = Vec::with_capacity(rows);
let mut all_probs_nan = true;
for row_index in 0..rows {
let row = x.get_row(row_index);
let mut max_log_prob = f64::NEG_INFINITY;
let mut max_class = None;
for (class_index, class) in y_classes.iter().enumerate() {
let log_likelihood = self.distribution.log_likelihood(class_index, &row);
let log_prob = log_likelihood + self.distribution.prior(class_index).ln();
if !log_prob.is_nan() && log_prob > max_log_prob {
max_log_prob = log_prob;
max_class = Some(*class);
all_probs_nan = false;
}
}
predictions.push(max_class.unwrap_or(y_classes[0]));
}
if all_probs_nan {
Err(Failed::predict(
"Failed to predict, all probabilities were NaN",
))
} else {
Ordering::Equal
Ok(Y::from_vec_slice(&predictions))
}
}
})
.map(|(prediction, _probability)| *prediction)
.ok_or_else(|| Failed::predict("Failed to predict, there is no result"))
})
.collect::<Result<Vec<TY>, Failed>>()?;
let y_hat = Y::from_vec_slice(&predictions);
Ok(y_hat)
}
}
pub mod bernoulli;
pub mod categorical;
@@ -147,7 +147,7 @@ mod tests {
#[derive(Debug, PartialEq, Clone)]
struct TestDistribution<'d>(&'d Vec<i32>);
impl<'d> NBDistribution<i32, i32> for TestDistribution<'d> {
impl NBDistribution<i32, i32> for TestDistribution<'_> {
fn prior(&self, _class_index: usize) -> f64 {
1.
}
@@ -177,7 +177,7 @@ mod tests {
Ok(_) => panic!("Should return error in case of empty classes"),
Err(err) => assert_eq!(
err.to_string(),
"Predict failed: Failed to predict, there is no result"
"Predict failed: Failed to predict, no classes available"
),
}
@@ -193,4 +193,441 @@ mod tests {
Err(_) => panic!("Should success in normal case without NaNs"),
}
}
// A simple test distribution using float
#[derive(Debug, PartialEq, Clone)]
struct TestDistributionAgain {
classes: Vec<u32>,
probs: Vec<f64>,
}
impl NBDistribution<f64, u32> for TestDistributionAgain {
fn classes(&self) -> &Vec<u32> {
&self.classes
}
fn prior(&self, class_index: usize) -> f64 {
self.probs[class_index]
}
fn log_likelihood<'a>(
&'a self,
class_index: usize,
_j: &'a Box<dyn ArrayView1<f64> + 'a>,
) -> f64 {
self.probs[class_index].ln()
}
}
type TestNB = BaseNaiveBayes<f64, u32, DenseMatrix<f64>, Vec<u32>, TestDistributionAgain>;
#[test]
fn test_predict_empty_classes() {
let dist = TestDistributionAgain {
classes: vec![],
probs: vec![],
};
let nb = TestNB::fit(dist).unwrap();
let x = DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0]]).unwrap();
assert!(nb.predict(&x).is_err());
}
#[test]
fn test_predict_single_class() {
let dist = TestDistributionAgain {
classes: vec![1],
probs: vec![1.0],
};
let nb = TestNB::fit(dist).unwrap();
let x = DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0]]).unwrap();
let result = nb.predict(&x).unwrap();
assert_eq!(result, vec![1, 1]);
}
#[test]
fn test_predict_multiple_classes() {
let dist = TestDistributionAgain {
classes: vec![1, 2, 3],
probs: vec![0.2, 0.5, 0.3],
};
let nb = TestNB::fit(dist).unwrap();
let x = DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0], &[5.0, 6.0]]).unwrap();
let result = nb.predict(&x).unwrap();
assert_eq!(result, vec![2, 2, 2]);
}
#[test]
fn test_predict_with_nans() {
let dist = TestDistributionAgain {
classes: vec![1, 2],
probs: vec![f64::NAN, 0.5],
};
let nb = TestNB::fit(dist).unwrap();
let x = DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0]]).unwrap();
let result = nb.predict(&x).unwrap();
assert_eq!(result, vec![2, 2]);
}
#[test]
fn test_predict_all_nans() {
let dist = TestDistributionAgain {
classes: vec![1, 2],
probs: vec![f64::NAN, f64::NAN],
};
let nb = TestNB::fit(dist).unwrap();
let x = DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0]]).unwrap();
assert!(nb.predict(&x).is_err());
}
#[test]
fn test_predict_extreme_probabilities() {
let dist = TestDistributionAgain {
classes: vec![1, 2],
probs: vec![1e-300, 1e-301],
};
let nb = TestNB::fit(dist).unwrap();
let x = DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0]]).unwrap();
let result = nb.predict(&x).unwrap();
assert_eq!(result, vec![1, 1]);
}
#[test]
fn test_predict_with_infinity() {
let dist = TestDistributionAgain {
classes: vec![1, 2, 3],
probs: vec![f64::INFINITY, 1.0, 2.0],
};
let nb = TestNB::fit(dist).unwrap();
let x = DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0]]).unwrap();
let result = nb.predict(&x).unwrap();
assert_eq!(result, vec![1, 1]);
}
#[test]
fn test_predict_with_negative_infinity() {
let dist = TestDistributionAgain {
classes: vec![1, 2, 3],
probs: vec![f64::NEG_INFINITY, 1.0, 2.0],
};
let nb = TestNB::fit(dist).unwrap();
let x = DenseMatrix::from_2d_array(&[&[1.0, 2.0], &[3.0, 4.0]]).unwrap();
let result = nb.predict(&x).unwrap();
assert_eq!(result, vec![3, 3]);
}
#[test]
fn test_gaussian_naive_bayes_numerical_stability() {
#[derive(Debug, PartialEq, Clone)]
struct GaussianTestDistribution {
classes: Vec<u32>,
means: Vec<Vec<f64>>,
variances: Vec<Vec<f64>>,
priors: Vec<f64>,
}
impl NBDistribution<f64, u32> for GaussianTestDistribution {
fn classes(&self) -> &Vec<u32> {
&self.classes
}
fn prior(&self, class_index: usize) -> f64 {
self.priors[class_index]
}
fn log_likelihood<'a>(
&'a self,
class_index: usize,
j: &'a Box<dyn ArrayView1<f64> + 'a>,
) -> f64 {
let means = &self.means[class_index];
let variances = &self.variances[class_index];
j.iterator(0)
.enumerate()
.map(|(i, &xi)| {
let mean = means[i];
let var = variances[i] + 1e-9; // Small smoothing for numerical stability
let coeff = -0.5 * (2.0 * std::f64::consts::PI * var).ln();
let exponent = -(xi - mean).powi(2) / (2.0 * var);
coeff + exponent
})
.sum()
}
}
fn train_distribution(x: &DenseMatrix<f64>, y: &[u32]) -> GaussianTestDistribution {
let mut classes: Vec<u32> = y
.iter()
.cloned()
.collect::<std::collections::HashSet<u32>>()
.into_iter()
.collect();
classes.sort();
let n_classes = classes.len();
let n_features = x.shape().1;
let mut means = vec![vec![0.0; n_features]; n_classes];
let mut variances = vec![vec![0.0; n_features]; n_classes];
let mut class_counts = vec![0; n_classes];
// Calculate means and count samples per class
for (sample, &class) in x.row_iter().zip(y.iter()) {
let class_idx = classes.iter().position(|&c| c == class).unwrap();
class_counts[class_idx] += 1;
for (i, &value) in sample.iterator(0).enumerate() {
means[class_idx][i] += value;
}
}
// Normalize means
for (class_idx, mean) in means.iter_mut().enumerate() {
for value in mean.iter_mut() {
*value /= class_counts[class_idx] as f64;
}
}
// Calculate variances
for (sample, &class) in x.row_iter().zip(y.iter()) {
let class_idx = classes.iter().position(|&c| c == class).unwrap();
for (i, &value) in sample.iterator(0).enumerate() {
let diff = value - means[class_idx][i];
variances[class_idx][i] += diff * diff;
}
}
// Normalize variances and add small epsilon to avoid zero variance
let epsilon = 1e-9;
for (class_idx, variance) in variances.iter_mut().enumerate() {
for value in variance.iter_mut() {
*value = *value / class_counts[class_idx] as f64 + epsilon;
}
}
// Calculate priors
let total_samples = y.len() as f64;
let priors: Vec<f64> = class_counts
.iter()
.map(|&count| count as f64 / total_samples)
.collect();
GaussianTestDistribution {
classes,
means,
variances,
priors,
}
}
type TestNBGaussian =
BaseNaiveBayes<f64, u32, DenseMatrix<f64>, Vec<u32>, GaussianTestDistribution>;
// Create a constant training dataset
let n_samples = 1000;
let n_features = 5;
let n_classes = 4;
let mut x_data = Vec::with_capacity(n_samples * n_features);
let mut y_data = Vec::with_capacity(n_samples);
for i in 0..n_samples {
for j in 0..n_features {
x_data.push((i * j) as f64 % 10.0);
}
y_data.push((i % n_classes) as u32);
}
let x = DenseMatrix::new(n_samples, n_features, x_data, true).unwrap();
let y = y_data;
// Train the model
let dist = train_distribution(&x, &y);
let nb = TestNBGaussian::fit(dist).unwrap();
// Create constant test data
let n_test_samples = 100;
let mut test_x_data = Vec::with_capacity(n_test_samples * n_features);
for i in 0..n_test_samples {
for j in 0..n_features {
test_x_data.push((i * j * 2) as f64 % 15.0);
}
}
let test_x = DenseMatrix::new(n_test_samples, n_features, test_x_data, true).unwrap();
// Make predictions
let predictions = nb
.predict(&test_x)
.map_err(|e| format!("Prediction failed: {}", e))
.unwrap();
// Check numerical stability
assert_eq!(
predictions.len(),
n_test_samples,
"Number of predictions should match number of test samples"
);
// Check that all predictions are valid class labels
for &pred in predictions.iter() {
assert!(pred < n_classes as u32, "Predicted class should be valid");
}
// Check consistency of predictions
let repeated_predictions = nb
.predict(&test_x)
.map_err(|e| format!("Repeated prediction failed: {}", e))
.unwrap();
assert_eq!(
predictions, repeated_predictions,
"Predictions should be consistent when repeated"
);
// Check extreme values
let extreme_x =
DenseMatrix::new(2, n_features, vec![f64::MAX; n_features * 2], true).unwrap();
let extreme_predictions = nb.predict(&extreme_x);
assert!(
extreme_predictions.is_err(),
"Extreme value input should result in an error"
);
assert_eq!(
extreme_predictions.unwrap_err().to_string(),
"Predict failed: Failed to predict, all probabilities were NaN",
"Incorrect error message for extreme values"
);
// Check for NaN handling
let nan_x = DenseMatrix::new(2, n_features, vec![f64::NAN; n_features * 2], true).unwrap();
let nan_predictions = nb.predict(&nan_x);
assert!(
nan_predictions.is_err(),
"NaN input should result in an error"
);
// Check for very small values
let small_x =
DenseMatrix::new(2, n_features, vec![f64::MIN_POSITIVE; n_features * 2], true).unwrap();
let small_predictions = nb
.predict(&small_x)
.map_err(|e| format!("Small value prediction failed: {}", e))
.unwrap();
for &pred in small_predictions.iter() {
assert!(
pred < n_classes as u32,
"Predictions for very small values should be valid"
);
}
// Check for values close to zero
let near_zero_x =
DenseMatrix::new(2, n_features, vec![1e-300; n_features * 2], true).unwrap();
let near_zero_predictions = nb
.predict(&near_zero_x)
.map_err(|e| format!("Near-zero value prediction failed: {}", e))
.unwrap();
for &pred in near_zero_predictions.iter() {
assert!(
pred < n_classes as u32,
"Predictions for near-zero values should be valid"
);
}
println!("All numerical stability checks passed!");
}
#[test]
fn test_gaussian_naive_bayes_numerical_stability_random_data() {
#[derive(Debug)]
struct MySimpleRng {
state: u64,
}
impl MySimpleRng {
fn new(seed: u64) -> Self {
MySimpleRng { state: seed }
}
/// Get the next u64 in the sequence.
fn next_u64(&mut self) -> u64 {
// LCG parameters; these are somewhat arbitrary but commonly used.
// Feel free to tweak the multiplier/adder etc.
self.state = self.state.wrapping_mul(6364136223846793005).wrapping_add(1);
self.state
}
/// Get an f64 in the range [min, max).
fn next_f64(&mut self, min: f64, max: f64) -> f64 {
let fraction = (self.next_u64() as f64) / (u64::MAX as f64);
min + fraction * (max - min)
}
/// Get a usize in the range [min, max). This floors the floating result.
fn gen_range_usize(&mut self, min: usize, max: usize) -> usize {
let v = self.next_f64(min as f64, max as f64);
// Truncate into the integer range. Because of floating inexactness,
// ensure we also clamp.
let int_v = v.floor() as isize;
// simple clamp to avoid any float rounding out of range
let clamped = int_v.max(min as isize).min((max - 1) as isize);
clamped as usize
}
}
use crate::naive_bayes::gaussian::GaussianNB;
// We will generate random data in a reproducible way (using a fixed seed).
// We will generate random data in a reproducible way:
let mut rng = MySimpleRng::new(42);
let n_samples = 1000;
let n_features = 5;
let n_classes = 4;
// Our feature matrix and label vector
let mut x_data = Vec::with_capacity(n_samples * n_features);
let mut y_data = Vec::with_capacity(n_samples);
// Fill x_data with random values and y_data with random class labels.
for _i in 0..n_samples {
for _j in 0..n_features {
// Well pick random values in [-10, 10).
x_data.push(rng.next_f64(-10.0, 10.0));
}
let class = rng.gen_range_usize(0, n_classes) as u32;
y_data.push(class);
}
// Create DenseMatrix from x_data
let x = DenseMatrix::new(n_samples, n_features, x_data, true).unwrap();
// Train GaussianNB
let gnb = GaussianNB::fit(&x, &y_data, Default::default())
.expect("Fitting GaussianNB with random data failed.");
// Predict on the same training data to verify no numerical instability
let predictions = gnb.predict(&x).expect("Prediction on random data failed.");
// Basic sanity checks
assert_eq!(
predictions.len(),
n_samples,
"Prediction size must match n_samples"
);
for &pred_class in &predictions {
assert!(
(pred_class as usize) < n_classes,
"Predicted class {} is out of range [0..n_classes).",
pred_class
);
}
// If you want to compare with scikit-learn, you can do something like:
// println!("X = {:?}", &x);
// println!("Y = {:?}", &y_data);
// println!("predictions = {:?}", &predictions);
// and then in Python:
// import numpy as np
// from sklearn.naive_bayes import GaussianNB
// X = np.reshape(np.array(x), (1000, 5), order='F')
// Y = np.array(y)
// gnb = GaussianNB().fit(X, Y)
// preds = gnb.predict(X)
// expected = np.array(predictions)
// assert expected == preds
// They should match closely (or exactly) depending on floating rounding.
}
}
+3 -7
View File
@@ -172,18 +172,14 @@ where
T: Number + RealNumber,
M: Array2<T>,
{
if let Some(output_matrix) = columns.first().cloned() {
return Some(
columns.first().cloned().map(|output_matrix| {
columns
.iter()
.skip(1)
.fold(output_matrix, |current_matrix, new_colum| {
current_matrix.h_stack(new_colum)
}),
);
} else {
None
}
})
})
}
#[cfg(test)]
+1 -1
View File
@@ -30,7 +30,7 @@ pub struct CSVDefinition<'a> {
/// What seperates the fields in your csv-file?
field_seperator: &'a str,
}
impl<'a> Default for CSVDefinition<'a> {
impl Default for CSVDefinition<'_> {
fn default() -> Self {
Self {
n_rows_header: 1,
+3 -3
View File
@@ -360,8 +360,8 @@ impl<'a, TX: Number + RealNumber, TY: Number + Ord, X: Array2<TX> + 'a, Y: Array
}
}
impl<'a, TX: Number + RealNumber, TY: Number + Ord, X: Array2<TX>, Y: Array1<TY>> PartialEq
for SVC<'a, TX, TY, X, Y>
impl<TX: Number + RealNumber, TY: Number + Ord, X: Array2<TX>, Y: Array1<TY>> PartialEq
for SVC<'_, TX, TY, X, Y>
{
fn eq(&self, other: &Self) -> bool {
if (self.b.unwrap().sub(other.b.unwrap())).abs() > TX::epsilon() * TX::two()
@@ -1110,7 +1110,7 @@ mod tests {
let svc = SVC::fit(&x, &y, &params).unwrap();
// serialization
let deserialized_svc: SVC<f64, i32, _, _> =
let deserialized_svc: SVC<'_, f64, i32, _, _> =
serde_json::from_str(&serde_json::to_string(&svc).unwrap()).unwrap();
assert_eq!(svc, deserialized_svc);
+3 -3
View File
@@ -281,8 +281,8 @@ impl<'a, T: Number + FloatNumber + PartialOrd, X: Array2<T>, Y: Array1<T>> SVR<'
}
}
impl<'a, T: Number + FloatNumber + PartialOrd, X: Array2<T>, Y: Array1<T>> PartialEq
for SVR<'a, T, X, Y>
impl<T: Number + FloatNumber + PartialOrd, X: Array2<T>, Y: Array1<T>> PartialEq
for SVR<'_, T, X, Y>
{
fn eq(&self, other: &Self) -> bool {
if (self.b - other.b).abs() > T::epsilon() * T::two()
@@ -702,7 +702,7 @@ mod tests {
let svr = SVR::fit(&x, &y, &params).unwrap();
let deserialized_svr: SVR<f64, DenseMatrix<f64>, _> =
let deserialized_svr: SVR<'_, f64, DenseMatrix<f64>, _> =
serde_json::from_str(&serde_json::to_string(&svr).unwrap()).unwrap();
assert_eq!(svr, deserialized_svr);
+113
View File
@@ -77,7 +77,9 @@ use serde::{Deserialize, Serialize};
use crate::api::{Predictor, SupervisedEstimator};
use crate::error::Failed;
use crate::linalg::basic::arrays::MutArray;
use crate::linalg::basic::arrays::{Array1, Array2, MutArrayView1};
use crate::linalg::basic::matrix::DenseMatrix;
use crate::numbers::basenum::Number;
use crate::rand_custom::get_rng_impl;
@@ -887,11 +889,77 @@ impl<TX: Number + PartialOrd, TY: Number + Ord, X: Array2<TX>, Y: Array1<TY>>
}
importances
}
/// Predict class probabilities for the input samples.
///
/// # Arguments
///
/// * `x` - The input samples as a matrix where each row is a sample and each column is a feature.
///
/// # Returns
///
/// A `Result` containing a `DenseMatrix<f64>` where each row corresponds to a sample and each column
/// corresponds to a class. The values represent the probability of the sample belonging to each class.
///
/// # Errors
///
/// Returns an error if at least one row prediction process fails.
pub fn predict_proba(&self, x: &X) -> Result<DenseMatrix<f64>, Failed> {
let (n_samples, _) = x.shape();
let n_classes = self.classes().len();
let mut result = DenseMatrix::<f64>::zeros(n_samples, n_classes);
for i in 0..n_samples {
let probs = self.predict_proba_for_row(x, i)?;
for (j, &prob) in probs.iter().enumerate() {
result.set((i, j), prob);
}
}
Ok(result)
}
/// Predict class probabilities for a single input sample.
///
/// # Arguments
///
/// * `x` - The input matrix containing all samples.
/// * `row` - The index of the row in `x` for which to predict probabilities.
///
/// # Returns
///
/// A vector of probabilities, one for each class, representing the probability
/// of the input sample belonging to each class.
fn predict_proba_for_row(&self, x: &X, row: usize) -> Result<Vec<f64>, Failed> {
let mut node = 0;
while let Some(current_node) = self.nodes().get(node) {
if current_node.true_child.is_none() && current_node.false_child.is_none() {
// Leaf node reached
let mut probs = vec![0.0; self.classes().len()];
probs[current_node.output] = 1.0;
return Ok(probs);
}
let split_feature = current_node.split_feature;
let split_value = current_node.split_value.unwrap_or(f64::NAN);
if x.get((row, split_feature)).to_f64().unwrap() <= split_value {
node = current_node.true_child.unwrap();
} else {
node = current_node.false_child.unwrap();
}
}
// This should never happen if the tree is properly constructed
Err(Failed::predict("Nodes iteration did not reach leaf"))
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::linalg::basic::arrays::Array;
use crate::linalg::basic::matrix::DenseMatrix;
#[test]
@@ -934,6 +1002,51 @@ mod tests {
);
}
#[cfg_attr(
all(target_arch = "wasm32", not(target_os = "wasi")),
wasm_bindgen_test::wasm_bindgen_test
)]
#[test]
fn test_predict_proba() {
let x: DenseMatrix<f64> = DenseMatrix::from_2d_array(&[
&[5.1, 3.5, 1.4, 0.2],
&[4.9, 3.0, 1.4, 0.2],
&[4.7, 3.2, 1.3, 0.2],
&[4.6, 3.1, 1.5, 0.2],
&[5.0, 3.6, 1.4, 0.2],
&[7.0, 3.2, 4.7, 1.4],
&[6.4, 3.2, 4.5, 1.5],
&[6.9, 3.1, 4.9, 1.5],
&[5.5, 2.3, 4.0, 1.3],
&[6.5, 2.8, 4.6, 1.5],
])
.unwrap();
let y: Vec<usize> = vec![0, 0, 0, 0, 0, 1, 1, 1, 1, 1];
let tree = DecisionTreeClassifier::fit(&x, &y, Default::default()).unwrap();
let probabilities = tree.predict_proba(&x).unwrap();
assert_eq!(probabilities.shape(), (10, 2));
for row in 0..10 {
let row_sum: f64 = probabilities.get_row(row).sum();
assert!(
(row_sum - 1.0).abs() < 1e-6,
"Row probabilities should sum to 1"
);
}
// Check if the first 5 samples have higher probability for class 0
for i in 0..5 {
assert!(probabilities.get((i, 0)) > probabilities.get((i, 1)));
}
// Check if the last 5 samples have higher probability for class 1
for i in 5..10 {
assert!(probabilities.get((i, 1)) > probabilities.get((i, 0)));
}
}
#[cfg_attr(
all(target_arch = "wasm32", not(target_os = "wasi")),
wasm_bindgen_test::wasm_bindgen_test