obtaining empirical CDF of a given data

by Blade   Last Updated August 13, 2019 22:19 PM

I have a dataset of variable $x$ that has a value between 0 and 6. I would like to have a function that defines empirical CDF of variable $x$. Since $x$ does not have a specific distribution (such as Gaussian, etc.), I need to rely on data values to create this function. Using the following code, I can plot the empirical CDF as:

max_diam = 6
ax = sns.distplot(x, hist_kws=dict(cumulative=True), kde_kws=dict(cumulative=True)).set(xlim=(0, max_diam))
ax = sns.kdeplot(x, bw=.1, cumulative=True).set(xlim=(0, max_diam), ylim=(0, 1.0))#, color="r")
plt.show()

enter image description here

Now I would like to find the function that kdeplot uses to plot CDF. I have tried to do regression, but the quality is not good, as there is only a single point after 4.9 (6.0) which makes the plot overfit for high orders and underfit in low orders

def ecdf(data):
    # Compute ECDF
    x = np.sort(data)
    n = x.size
    y = np.arange(1, n+1) / n
    return(x,y)

x, y = ecdf(x)
degree=7
lw = 2

plt.scatter(x=x, y=y, s=10);
plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)

polynomial_features= PolynomialFeatures(degree)
x_poly = polynomial_features.fit_transform(x.reshape(-1, 1))

model = skl.LinearRegression()
model.fit(x_poly, y)

x_test = polynomial_features.fit_transform(x_plot.reshape(-1, 1))
y_test = model.predict(x_test)
plt.plot(x_plot, y_test, color='yellowgreen', linewidth=lw, label="degree %d" % degree)

plt.show()

enter image description here

So my question is:

  1. Is there a way to get the function that kdeplot is using for plotting the orange line?
  2. Is there a way to have a better regression that is accurate and does not over/underfit?


Related Questions


Updated June 01, 2015 17:08 PM

Updated August 01, 2015 13:08 PM