# obtaining empirical CDF of a given data

by Blade   Last Updated August 13, 2019 22:19 PM

I have a dataset of variable $$x$$ that has a value between 0 and 6. I would like to have a function that defines empirical CDF of variable $$x$$. Since $$x$$ does not have a specific distribution (such as Gaussian, etc.), I need to rely on data values to create this function. Using the following code, I can plot the empirical CDF as:

``````max_diam = 6
ax = sns.distplot(x, hist_kws=dict(cumulative=True), kde_kws=dict(cumulative=True)).set(xlim=(0, max_diam))
ax = sns.kdeplot(x, bw=.1, cumulative=True).set(xlim=(0, max_diam), ylim=(0, 1.0))#, color="r")
plt.show()
``````

Now I would like to find the function that `kdeplot` uses to plot CDF. I have tried to do regression, but the quality is not good, as there is only a single point after 4.9 (6.0) which makes the plot overfit for high orders and underfit in low orders

``````def ecdf(data):
# Compute ECDF
x = np.sort(data)
n = x.size
y = np.arange(1, n+1) / n
return(x,y)

x, y = ecdf(x)
degree=7
lw = 2

plt.scatter(x=x, y=y, s=10);
plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)

polynomial_features= PolynomialFeatures(degree)
x_poly = polynomial_features.fit_transform(x.reshape(-1, 1))

model = skl.LinearRegression()
model.fit(x_poly, y)

x_test = polynomial_features.fit_transform(x_plot.reshape(-1, 1))
y_test = model.predict(x_test)
plt.plot(x_plot, y_test, color='yellowgreen', linewidth=lw, label="degree %d" % degree)

plt.show()
``````

So my question is:

1. Is there a way to get the function that `kdeplot` is using for plotting the orange line?
2. Is there a way to have a better regression that is accurate and does not over/underfit?
Tags :

## Related Questions

Updated May 24, 2019 07:19 AM

Updated June 01, 2015 17:08 PM

Updated August 01, 2015 13:08 PM