The project I’m going to talk about here was an attempt to build a data-driven model for rock permeability from the corresponding petrographic variables, such as porosity, mud content, grain fractions and so on. These variables were obtained from applying the so-called point counting technique to rock thin sections like this one:

In the end, each sample was characterized by 28 features.

Permeability of a given rock sample depends of the total amount of pore space, characterized by porosity, as well as how the pore space is distributed within the rock, which we can express as some kind of a probability density function of pore sizes. In essence, what I’m going to investigate is whether the information about pore size distribution can be sufficiently captured by the bulk petrographic properties extracted from thin sections.

First we load the data (which I cannot share, unfortunately) and pick the samples that are permeable and discard dolomite rocks, because their permeability behavior is significantly different from limestones and will be studied separately.

data = pd.read_csv('ts.csv', na_values='-') data.fillna(0.0, inplace=True) data = data[(data['FACIES'] != 'DOLO') & (data['PERM'] != 0)

This results in 572 samples with 28 features, of which 5 are categorical and the rest are numeric. Next we look for errors in data. Errors come from the fact that some features are volumetric fractions with a total sum of either 100 percent or a value of another feature. For instance, we have 11 different grain type features that should ideally sum to 100. Alternatively, grain fractions were not counted at all for some samples and hence should sum to 0.

# GRAINS grain_component_columns = ['OOID', 'PELOID', 'FORAM', 'MOLL', 'ECHIN', 'CORAL', 'STROM', 'DALGAE', 'COATGRN', 'INTRAC', 'OTHERGRN'] grain_sums = data[grain_component_columns].sum(axis=1).values wrong_grains = np.logical_not(np.logical_or(np.isclose(grain_sums, 100), np.isclose(grain_sums, 0.0))) print(f"Number of incorrectly assigned grain fractions: {np.count_nonzero(wrong_grains)}") # POROSITY porosity_component_columns = ['Interparticle (BP)', 'Moldic (MO)', 'WP intraparticle', 'BC Intercrystalline', 'OTHERPOR'] porosity_sum = data[porosity_component_columns].sum(axis=1).values porosity_difference = porosity_sum - data['POR'].values wrong_porosity = np.logical_not(np.isclose(porosity_difference, 0.0, atol=1e-3)) print(f"Number of incorrectly assigned porosity fractions: {np.count_nonzero(wrong_porosity)}") # CEMENT cement_component_columns = ['FC', 'MO', 'EC'] cement_sum = data[cement_component_columns].sum(axis=1).values cement_difference = cement_sum - data['Total Cement'].values wrong_cement = np.logical_not(np.isclose(cement_difference, 0.0, atol=1e-3)) print(f"Number of incorrectly assigned cement fractions: {np.count_nonzero(wrong_cement)}") # TOTAL VOLUME total_volume_components = ['POR', 'Mud and matrix', 'TOTGRN', 'Total Cement'] volume_fractions_sum = data[total_volume_components].sum(axis=1).values wrong_volume = np.logical_not(np.isclose(volume_fractions_sum, 100)) print(f"Number of incorrectly assigned volume fractions: {np.count_nonzero(wrong_volume)}")

Number of incorrectly assigned grain fractions: 96 Number of incorrectly assigned porosity fractions: 101 Number of incorrectly assigned cement fractions: 1 Number of incorrectly assigned volume fractions: 1

There is a significant number of samples with errors. The decision I made was to discard samples with errors in porosity, cement or volume related features and discard grain fraction features as too unreliable, so that I can keep the samples with errors in grain features alone.

Before moving forward, I log-transformed permeability values and then normalized all the features to zero mean and unit variance:

from sklearn.preprocessing import StandardScaler from math import log10 X = data_red.drop('PERM', axis=1).values y = data_red['PERM'].apply(lambda x: log10(x)).values X_std = StandardScaler().fit_transform(X)

After that I performed feature selection using a range of selection algorithms: Pearson correlation, distance correlation (based on this gist), ranking based on elastic net regression, random forest ranking, stability selection using randomized lasso and recursive feature elimination.

from sklearn.ensemble import RandomForestRegressor from sklearn.feature_selection import f_regression from sklearn.linear_model import ElasticNet cols = data_red.drop('PERM', axis=1).columns feature_rankings = pd.DataFrame(index=cols) # univariate feature selection def dist(x, y): return np.abs(x[:, np.newaxis] - y) def d_n(x): d = dist(x, x) dn = d - d.mean(0) - d.mean(1)[:, np.newaxis] + d.mean() return dn def dcov_all(x, y): dnx = d_n(x) dny = d_n(y) denom = np.product(dnx.shape) dc = (dnx * dny).sum() / denom dvx = (dnx**2).sum() / denom dvy = (dny**2).sum() / denom dr = dc / (np.sqrt(dvx) * np.sqrt(dvy)) return dc, dr, dvx, dvy f_test, _ = f_regression(X_std, y) f_test /= np.max(f_test) dc_scores = [] for i in range(X_std.shape[1]): dc, _, _, _ = dcov_all(X_std[:, i], y) dc_scores.append(dc) dc_scores = np.array(dc_scores) dc_scores /= np.max(dc_scores) feature_rankings['Pearson'] = f_test feature_rankings['DC'] = dc_scores # linear models with regularization elastic_net = ElasticNet(alpha=0.7, l1_ratio=0.1) elastic_net.fit(X_std, y) feature_rankings['ElasticNet'] = elastic_net.coef_ # random forest feature importance rf = RandomForestRegressor(random_state=0) rf.fit(X_std, y) feature_rankings['RF'] = rf.feature_importances_ / np.max(rf.feature_importances_) # stability selection from sklearn.linear_model import RandomizedLasso rlasso = RandomizedLasso(alpha=0.005, n_jobs=-1) rlasso.fit(X_std, y) feature_rankings['Stability'] = rlasso.scores_ # recursive feature elimination from sklearn.feature_selection import RFE selector = RFE(elastic_net, n_features_to_select=1) selector.fit(X_std, y) feature_rankings['RFE'] = selector.ranking_ print(feature_rankings)

Pearson | DC | ElasticNet | RF | Stability | RFE | |
---|---|---|---|---|---|---|

DEPTH | 0.012409 | 0.043601 | -0.000000 | 1.775766e-02 | 0.000 | 52 |

POR | 1.000000 | 1.000000 | 0.300875 | 1.000000e+00 | 1.000 | 1 |

Mud and matrix | 0.979700 | 0.944287 | -0.232541 | 2.740140e-01 | 0.695 | 2 |

TOTGRN | 0.329289 | 0.512823 | 0.125154 | 3.748489e-02 | 0.320 | 3 |

DoLo | 0.160897 | 0.348428 | -0.114194 | 2.268208e-02 | 0.805 | 4 |

Total Cement | 0.072758 | 0.182865 | 0.000000 | 7.463309e-03 | 0.035 | 21 |

FC | 0.091295 | 0.186574 | 0.000000 | 5.180301e-03 | 0.020 | 20 |

MO | 0.066512 | 0.169349 | 0.000000 | 1.397019e-03 | 0.000 | 19 |

EC | 0.000186 | 0.004129 | 0.000000 | 4.406653e-03 | 0.020 | 22 |

Interparticle (BP) | 0.437091 | 0.652144 | 0.150489 | 2.734478e-02 | 0.290 | 6 |

Moldic (MO) | 0.022927 | 0.065443 | 0.071612 | 1.832028e-02 | 0.270 | 11 |

WP intraparticle | 0.021676 | 0.051974 | 0.000000 | 1.573746e-02 | 0.000 | 28 |

BC Intercrystalline | 0.007280 | 0.014063 | -0.000000 | 1.983509e-04 | 0.000 | 34 |

OTHERPOR | 0.000575 | 0.002267 | 0.000000 | 6.326667e-03 | 0.100 | 35 |

Anhy | 0.002769 | 0.002500 | 0.000000 | 8.223801e-04 | 0.005 | 23 |

FACIES_BCGI | 0.001095 | 0.018077 | 0.020526 | 1.116013e-02 | 0.420 | 15 |

FACIES_CLADO | 0.011248 | 0.017371 | 0.000000 | 3.931908e-04 | 0.000 | 33 |

FACIES_MIC | 0.355562 | 0.596872 | -0.104562 | 9.380061e-03 | 0.535 | 5 |

FACIES_SO | 0.028381 | 0.086697 | -0.000000 | 8.336911e-03 | 0.270 | 36 |

FACIES_SRAC | 0.046250 | 0.107694 | 0.082938 | 4.219528e-03 | 0.615 | 7 |

TEXTURE_B | 0.003209 | 0.002197 | 0.000000 | 1.573111e-07 | 0.000 | 40 |

TEXTURE_G | 0.114598 | 0.212404 | 0.112852 | 6.026094e-03 | 0.525 | 8 |

TEXTURE_M | 0.149274 | 0.175061 | -0.083807 | 2.027910e-03 | 0.460 | 10 |

TEXTURE_MLP | 0.181436 | 0.474996 | 0.089893 | 6.848541e-05 | 0.190 | 9 |

TEXTURE_P | 0.023254 | 0.261818 | -0.009810 | 1.449022e-02 | 0.070 | 17 |

TEXTURE_W | 0.116260 | 0.226927 | -0.042081 | 3.227577e-03 | 0.060 | 12 |

FIELD_ANDR | 0.008156 | 0.006036 | 0.000000 | 7.963694e-05 | 0.000 | 51 |

FIELD_HRDH | 0.001377 | 0.006590 | 0.000000 | 2.866362e-03 | 0.085 | 30 |

FIELD_HWYH | 0.000298 | 0.001514 | -0.000000 | 1.839563e-04 | 0.000 | 29 |

FIELD_SDGM | 0.000316 | 0.008238 | 0.000000 | 2.149841e-03 | 0.000 | 27 |

FIELD_UTMN | 0.004103 | 0.017334 | -0.000000 | 1.365817e-03 | 0.110 | 25 |

ZONE_1 | 0.001849 | 0.002020 | -0.000000 | 3.129510e-03 | 0.000 | 31 |

ZONE_2A | 0.094149 | 0.238360 | 0.000000 | 1.640882e-03 | 0.000 | 32 |

ZONE_2B | 0.023751 | 0.083241 | 0.026671 | 1.543156e-03 | 0.120 | 13 |

ZONE_3A | 0.055387 | 0.167929 | -0.008095 | 1.245921e-03 | 0.035 | 18 |

ZONE_3B | 0.035680 | 0.073229 | -0.000000 | 1.415927e-03 | 0.000 | 26 |

ZONE_4 | 0.043935 | 0.042175 | -0.025119 | 1.454697e-03 | 0.105 | 14 |

WELL_32 | 0.003606 | 0.008269 | -0.000000 | 9.772440e-03 | 0.030 | 24 |

WELL_45 | 0.008116 | 0.015835 | 0.016670 | 3.000428e-03 | 0.365 | 16 |

WELL_46 | 0.005061 | 0.005595 | -0.000000 | 1.730220e-03 | 0.000 | 37 |

WELL_71 | 0.001412 | 0.002288 | 0.000000 | 3.930716e-04 | 0.000 | 38 |

WELL_98 | 0.000298 | 0.001514 | -0.000000 | 1.102958e-03 | 0.000 | 39 |

WELL_193 | 0.004559 | 0.008605 | 0.000000 | 1.978925e-04 | 0.000 | 41 |

WELL_217 | 0.008156 | 0.006036 | 0.000000 | 5.853586e-05 | 0.000 | 42 |

WELL_222 | 0.001755 | 0.010385 | 0.000000 | 5.463477e-04 | 0.000 | 43 |

WELL_527 | 0.000064 | 0.004559 | 0.000000 | 1.092136e-03 | 0.000 | 44 |

WELL_530 | 0.000370 | 0.002327 | -0.000000 | 5.450152e-04 | 0.000 | 45 |

WELL_547 | 0.000380 | 0.007311 | -0.000000 | 2.273051e-03 | 0.000 | 46 |

WELL_586 | 0.000012 | 0.001414 | 0.000000 | 9.969440e-04 | 0.000 | 47 |

WELL_602 | 0.009914 | 0.024923 | -0.000000 | 8.919203e-04 | 0.000 | 48 |

WELL_603 | 0.004082 | 0.006328 | -0.000000 | 5.021024e-07 | 0.010 | 49 |

WELL_660 | 0.000223 | 0.006313 | -0.000000 | 5.037259e-05 | 0.000 | 50 |

Based on these results I discarded data on sample location: field, geologic zone, well and depth. I also excluded anhydrite fraction as irrelevant.

After I was happy with the feature set, I moved on with nested cross-validation to pick the most appropriate model for training, but all the regression algorithms I tested returned very similar resulting R2 metrics. Therefore I decided to examine the learning curves for different models. I began with having only porosity and mud as input features (original hypothesis offered by my colleague) and applied kernel ridge and gradient boosting regression models:

As one might expect, the tree-based method definitely overfits the data, while the quick plateauing of the training accuracy of kernel ridge regression indicates a high bias level so we need to add additional features to our samples. Adding all major volumetric components (porosity, mud, grains and cement total fractions) does not alter the result, but adding dolomite fraction definitely improves the prediction:

Any additional features simply increase the variance while not improving the accuracy, with using all features being the most extreme case:

Finally, here is the residuals graph for the kernel ridge regression with total volume fractions and dolomite fraction as features:

Residual distribution is definitely not random, we can see patterns and clustering at high permeability values. The resulting prediction errors are about an order of magnitude. So, in conclusion I would say that the measured petrographic variables are definitely not sufficient to explain the permeability behavior of the studied carbonate rocks. In hindsight, it is not very surprising because volumetric fractions do not carry any information on how the pore sizes are distributed, which is a key control on permeability.