-
Notifications
You must be signed in to change notification settings - Fork 12
add rho TCopula and Gaussian CDF #330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #330 +/- ##
===========================================
- Coverage 77.95% 17.65% -60.31%
===========================================
Files 84 84
Lines 4986 5120 +134
===========================================
- Hits 3887 904 -2983
- Misses 1099 4216 +3117 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This is very interesting, but i think we need more tests and more comparisons to MvNormalCDF and/or other sources of truths (for example, you could compare to R::mvtnorm ?)
|
|
Okay, all of this needs to be studied. I'll expand the validation set as follows: Perhaps I could add benchmarks using MvNormalCDF's dimension-based scaling (m = 1000*d)? I'll compare my deltas to the reported error estimates because I currently use the MvNormalCDF package as a reference. Also, I think we could add a basic consistency test for the Student's t-copula CDF. However, I think that comparing it directly with R::mvtnorm would put us at a disadvantage, because: R uses Alan Genz's Fortran implementation, vectorized and compiled, right? |
|
|
Well, I did the following: @testset "GaussianCopula – Scaling vs MvNormalCDF" begin
using MvNormalCDF, StatsFuns, BenchmarkTools
ρ = 0.5
dims = [2, 3, 5, 10, 15, 20]
for d in dims
Σ = fill(ρ, d, d)
Σ[diagind(Σ)] .= 1.0
C = GaussianCopula(Σ)
u = fill(0.8, d)
x = StatsFuns.norminvcdf.(u)
m = min(1_000*d, 100_000)
p_ref, err_ref = MvNormalCDF.mvnormcdf(Σ, fill(-Inf, d), x; m=m)
p_fast = Copulas._cdf(C, u)
Δ = abs(p_ref - p_fast)
@info "d=$d Δ=$Δ err_ref=$err_ref"
@test Δ ≤ 5 * err_ref
end
endand I got these results Validation vs
|
| d | Δ | err_ref | Δ ≤ 5×err_ref |
|---|---|---|---|
| 2 | 1.29e−4 | 2.43e−4 | ✅ |
| 3 | 2.14e−4 | 3.32e−4 | ✅ |
| 5 | 6.69e−5 | 3.71e−4 | ✅ |
| 10 | 1.28e−4 | 2.87e−4 | ✅ |
| 15 | 2.06e−4 | 5.94e−4 | ✅ |
| 20 | 7.77e−5 | 4.95e−4 | ✅ |
All tests pass:
Test Summary: | Pass Total Time
GaussianCopula – Scaling vs MvNormalCDF | 6 6 3.8sI added the tests so you can view them and the benchmark shows the following
julia> val_est = Copulas._cdf_base(Σ, b; abseps=1e-4, releps=1e-4, maxpts=35_000, m0=1028, r=2)
0.6653147302829496
julia> @show val_est
val_est = 0.6653147302829496
0.6653147302829496
julia> @benchmark Copulas._cdf_base($Σ, $b)
BenchmarkTools.Trial: 2308 samples with 1 evaluation per sample.
Range (min … max): 1.899 ms … 23.032 ms ┊ GC (min … max): 0.00% … 90.53%
Time (median): 2.113 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.160 ms ± 521.602 μs ┊ GC (mean ± σ): 1.94% ± 6.39%
▄█▆▃▂
▃▅▄▄█████▇▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▅▆ █
1.9 ms Histogram: log(frequency) by time 4.12 ms <
Memory estimate: 553.34 KiB, allocs estimate: 4209.
julia> @benchmark Copulas._cdf_base($Σ, $b)
BenchmarkTools.Trial: 9824 samples with 1 evaluation per sample.
Range (min … max): 421.708 μs … 26.419 ms ┊ GC (min … max): 0.00% … 97.90%
Time (median): 483.375 μs ┊ GC (median): 0.00%
Time (mean ± σ): 507.761 μs ± 478.827 μs ┊ GC (mean ± σ): 4.39% ± 5.10%
▂▄▆▇█▆▄▂
▂▁▁▁▁▁▂▂▁▂▂▂▂▁▁▂▂▂▂▂▁▁▁▁▁▁▂▂▂▂▃▄▅██████████▇▆▅▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂ ▃
422 μs Histogram: frequency by time 519 μs <
Memory estimate: 196.05 KiB, allocs estimate: 4160.These are the results, for comparison you can see the end of the MvNormalCDF.jl test file here
|
Can you re-run the benchmarks with the two implementations, for the same level of precision (e.g. about the same number of samples) on the same machine, so that we get comparable timings ? maybe their machine was very slow compared to yours. w.r.t. quality it seems like your new version is OK compared to the previous one, so perfect. If possible, It'll be nice if the default number of sample matches what MvNormalCDF did, that is something growxing linearly with the dimension. But since we are that much faster, maybe we can add a bit more at the begining, like Other question:
|
|
I've been working a bit using exactly the same conditions and in general on my laptop (the fast one) it's still much faster but if we use the same amount of samples and everything exactly as MvNormal does there we lose... maybe because of the Sobol sequence I use, I was seeing that it uses a sequence that I don't know very well... I think that one is faster than Sobol. |
|
Yes but if the new method is more precise with less samples then we win ? Another way out: convert the current method from the Gaussian CDF to the Student CDF, which is really missing from the current implementation, and then you always win since there is no competition yet ;) |
|
Okay, I spent a bit of time reviewing the other package... I didn't quite understand what happened XD Back to the topic... I’ve got it working now. Although I couldn’t fully match the Fortran implementation used in As expected, using 12 randomized quasi-Monte Carlo replications provides a proper error estimate. However, when we skip the error estimation (single pass), the method becomes faster and more memory-efficient — in that configuration, it clearly outperforms existing Julia implementations. The same approach extends naturally to the multivariate Student-t case, which I’ve already implemented. It works well, although the stochastic error is somewhat larger, likely due to the Sobol sequence and the adaptive Cholensky. |
|
Yes I am sorry for w<hat happen with the other package, maybe we could rediscuss it together someday. Waiting for your code here to be ready, tell me when you want me to take a look |
|
Yes, I'll let you know when I have it ready so you can check it out. It won't take long, but I forgot my glasses and I can't work well like that 😂 |
|
Hi @lrnv — I tried implementing these functions and adapted parts of the logic from For the t distribution, I experimented with several approaches, but none can really match pure Fortran performance. So I decided to reuse the same QMC structure from the Gaussian case and just add a small radial χ²-based block. It works quite well and has been empirically validated against I think it could still be made a bit more “Julian,” but both versions perform consistently and respond well across dimensions. Below is a short summary of what’s implemented. The only missing step is to hook it into the main API so that it dispatches directly from Implemented functions
Example 1: Normal using Random, LinearAlgebra, Distributions, MvNormalCDF, StableRNGs, Copulas
rng = StableRNG(123)
d = 8
A = randn(rng, d, d)
Σ = Matrix(Symmetric(A*A' + 0.1I))
a = fill(-Inf, d)
b = randn(rng, d)
m, r = 25_000, 12
rng_ref, rng_qmc = StableRNG(202), StableRNG(202)
p_ref, e_ref = MvNormalCDF.qsimvnv!(copy(Σ), copy(a), copy(b), m, rng_ref)
p_qmc, e_qmc = Copulas.qmc_orthant_normal!(copy(Σ), copy(b); m=m, r=r, rng=rng_qmc)
println(("p_ref = $p_ref ± $e_ref", "p_qmc = $p_qmc ± $e_qmc", "|Δ| = $(abs(p_ref - p_qmc))"))
("p_ref = 0.024469473604786227 ± 4.8609855372307254e-5",
"p_qmc = 0.02446947360478623 ± 4.860985537231269e-5",
"|Δ| = 3.469446951953614e-18")Example 2 t multivariate rng = StableRNG(123)
d, ν = 10, 5
A = randn(rng, d, d)
Σ = Matrix(Symmetric(A*A' + 0.1I))
b = randn(rng, d)
p_t, e_t = Copulas.qmc_orthant_t!(copy(Σ), copy(b), ν; m=25_000, r=12, rng=StableRNG(202))
println("Student-t (ν=$ν): p = $p_t ± $e_t")
Student-t (ν=5): p = 0.0003943872 ± 3.60e-6Reference (R → Results match exactly within the estimated error range, while our implementation uses less memory and fewer allocations. Furthermore, this idea serves to extend... for example for skeew t copulas... etc etc (although it is true that they are not used much 😂) |
|
@Santymax98 i'm confused: the current state of this pull-request does not change the _cdf() of Gaussian and T copula, and all the methods added are not even used in the packages API ? What Am i supposed to do with it ? |
|
Sorry — I just realized I forgot to push the I’ve now added the dispatch for For example, using the covariance matrix of size 25 from the tests in using Random, BenchmarkTools, Copulas
Random.seed!(123)
td = td[14, 1] #assuming you copied the entire td from the MvNormalCDF.jl tests
G = GaussianCopula(td)
u = rand(size(td,1))
@benchmark cdf(G, u; method=:qmc)
BenchmarkTools.Trial: 244 samples with 1 evaluation per sample.
Range (min … max): 20.070 ms … 37.483 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 20.190 ms ┊ GC (median): 0.00%
Time (mean ± σ): 20.464 ms ± 1.707 ms ┊ GC (mean ± σ): 0.22% ± 1.93%
█▇▂
███▆▅▄▁▅▄▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄ ▅
20.1 ms Histogram: log(frequency) by time 26.9 ms <
Memory estimate: 488.92 KiB, allocs estimate: 29.
@benchmark cdf(G, u; method=:normal)
BenchmarkTools.Trial: 197 samples with 1 evaluation per sample.
Range (min … max): 25.122 ms … 33.025 ms ┊ GC (min … max): 0.00% … 23.46%
Time (median): 25.385 ms ┊ GC (median): 0.00%
Time (mean ± σ): 25.436 ms ± 558.561 μs ┊ GC (mean ± σ): 0.15% ± 1.67%
▁█▅▄▄ ▅▅▄
▃▁▁▁▃▅▅▆▆█████████▆▅▇▅▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
25.1 ms Histogram: frequency by time 26.3 ms <
Memory estimate: 596.03 KiB, allocs estimate: 485.
T = TCopula(5, td)
@benchmark cdf(T, u)
BenchmarkTools.Trial: 134 samples with 1 evaluation per sample.
Range (min … max): 36.161 ms … 48.235 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 37.259 ms ┊ GC (median): 0.00%
Time (mean ± σ): 37.451 ms ± 1.196 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▃█
▄▄▂▂▁▂▃████▆▆▃▁▃▄▁▁▁▅▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
36.2 ms Histogram: frequency by time 43.2 ms <
Memory estimate: 509.06 KiB, allocs estimate: 167.The QMC version matches PS: For the Gaussian case I only left the two implementations for comparison... |
|
Okay thanks ^^ So now i guess :
|
|
Yes, I'm working on it right now... actually it was a bit difficult at first to design something that works well but I think we got it... I think we don't have to change the tests... because it should work the same hahaha or so I hope... in the case of the Gaussian I left both methods so we can compare directly but if everything goes well I think it's your decision whether to leave both methods or not... although if it goes well we could remove the MvNormalCDF dependency |
|
yes, I think that the right approach is to remove the dependency to MvNormalCDF, except in the test where it is needed to compare to it of course |
|
Okay, I'll keep working on this... and on the tests... I won't be long 👍 |
|
Ok, the Gaussian tests pass. One problem with the approximation of the marginal values of the T cdf in terms of tolerance... I just don't know if I should modify the entire tolerance or maybe adjust something else... By the way, I think there are simpler forms for two and three dimensions in the codes I've reviewed. Perhaps it would be appropriate to implement them? Or is it better to leave this one out for all cases? |
|
By simpler forms, you mean faster ? it would be nice to keep the generality and have something that works for all dimensions, but if there is a very very faster version for smaller dimensions, then it might be usefull to have it too.
Why did you remove the |
|
Okay, I was working on it. I removed the td because they're generic, and I was debugging only the ones that work in our copula case, for example, selecting those that only have a lower bound of -infinity; otherwise, it doesn't make sense to talk about copulas. But if you'd like, I can modify them to compare the direct results of MvNormalCDF.jl with the new implementation. qmc_orthant_normal! Specifying the case where the lower bound is -infinity because otherwise, it wouldn't work. |
|
Do as you think is best, I haven't look at the details yet, i was just surprised that they disappeared ^^ |
…iate version exists?
Yes of course for the TCopula's CDF that is probably the right way to go |
|
For tests that are not from -Inf to b you shoudl compare to the |
|
By the tests, I see tha the default precision is not very good : 1e-3 for the TCopula uniformity checks, and 1e-3 for the measure() checks on the Gaussan copula ? Maybe we should increase the default number of samples to reduce these tolerences ? |
|
For the Gaussian, I think that error is unfair... note that measure() also uses 2^d times the Gaussian cdf from which we remove the error that they bring, that is why the tolerance must consider the error of the cdf with qmc + the error that the measure brings, as for the t... I think that its convergence takes a little longer... a solution could be to specify right now the routines for two and three variables |
|
|
Hmm, yes, theirs is adaptive, meaning they increase the sample size until they achieve the desired precision... |
|
I tweaked the tests a bit, and the default sample size has increased... Generally, i think that should be documented, right? Or maybe put in a docstring? |
|
Yes, everything is this PR needs to be documented ;) The right place to do it is in the But most importantly, there is no tests yet for the TCopula's cdf ? You should test it in
|
|
I'll work on the tests... to conclude... |
|
I prepared a test but for |
|
Hey just wanted to tell you @Santymax98 that i have not forgotten about this i just have less time right now, I will come back next week with updates :) |
|
Hi @lrnv , no problem, I'm also working on some things. I hope to finish them soon so I can conclude here with documentation and some tests. |


🚀 Fast Gaussian Copula CDF + Student-t Correlation Extensions
Summary
This PR introduces a new high-performance Monte Carlo estimator for the Gaussian copula CDF based on the GHK (Geweke–Hajivassiliou–Keane) algorithm combined with Sobol quasi-random sequences and Cranley–Patterson random shifts.
The method achieves both high numerical accuracy and massive speedups, outperforming classical GHK implementations (e.g.,
mvtnorm::pmvnormin R orMvNormalCDF.jlin Julia).A fully numeric implementation of ₂F₁(ν,ν;2ν;z) and the Spearman ρₛ for the t-Copula, removing the dependency on
HypergeometricFunctions.jl, (But it is the administrator's decision to see whether this dependency is attached or not)...Implementation Details
_cdf_basewith Sobol quasi–Monte Carlo and antithetic samplingRowMaximum()fast::Bool=true(default) for efficient evaluationThe implementation supports both:
fast=true→ optimized for simulation and fittingfast=false→ adaptive high-precision reference computationBenchmarks (ρ = 0.5, u = 0.8, 50k samples)
✅ Example Usage
t-Copula Additions
_Gauss_hypergeometric,_Gauss_Euler, and_Gauss2F1_hybridrhoS_t(ν,ρ)(Spearman correlation) via adaptive quadrature