Machine Learning and Thin Sections

The project I’m going to talk about here was an attempt to build a data-driven model for rock permeability from the corresponding petrographic variables, such as porosity, mud content, grain fractions and so on. These variables were obtained from applying the so-called point counting technique to rock thin sections like this one:

GhawarBCGI

In the end, each sample was characterized by 28 features.

Permeability of a given rock sample depends of the total amount of pore space, characterized by porosity, as well as how the pore space is distributed within the rock, which we can express as some kind of a probability density function of pore sizes. In essence, what I’m going to investigate is whether the information about pore size distribution can be sufficiently captured by the bulk petrographic properties extracted from thin sections.

First we load the data (which I cannot share, unfortunately) and pick the samples that are permeable and discard dolomite rocks, because their permeability behavior is significantly different from limestones and will be studied separately.

data = pd.read_csv('ts.csv', na_values='-')
data.fillna(0.0, inplace=True)
data = data[(data['FACIES'] != 'DOLO') & (data['PERM'] != 0)

This results in 572 samples with 28 features, of which 5 are categorical and the rest are numeric. Next we look for errors in data. Errors come from the fact that some features are volumetric fractions with a total sum of either 100 percent or a value of another feature. For instance, we have 11 different grain type features that should ideally sum to 100. Alternatively, grain fractions were not counted at all for some samples and hence should sum to 0.

# GRAINS
grain_component_columns = ['OOID', 'PELOID', 'FORAM', 'MOLL', 'ECHIN', 'CORAL',
                           'STROM', 'DALGAE', 'COATGRN', 'INTRAC', 'OTHERGRN']
grain_sums = data[grain_component_columns].sum(axis=1).values
wrong_grains = np.logical_not(np.logical_or(np.isclose(grain_sums, 100),
                                            np.isclose(grain_sums, 0.0)))
print(f"Number of incorrectly assigned grain fractions: {np.count_nonzero(wrong_grains)}")

# POROSITY
porosity_component_columns = ['Interparticle (BP)', 'Moldic (MO)', 'WP intraparticle',
                              'BC Intercrystalline', 'OTHERPOR']
porosity_sum = data[porosity_component_columns].sum(axis=1).values
porosity_difference = porosity_sum  - data['POR'].values
wrong_porosity = np.logical_not(np.isclose(porosity_difference, 0.0, atol=1e-3))
print(f"Number of incorrectly assigned porosity fractions: {np.count_nonzero(wrong_porosity)}")

# CEMENT
cement_component_columns = ['FC', 'MO', 'EC']
cement_sum = data[cement_component_columns].sum(axis=1).values
cement_difference = cement_sum  - data['Total Cement'].values
wrong_cement = np.logical_not(np.isclose(cement_difference, 0.0, atol=1e-3))
print(f"Number of incorrectly assigned cement fractions: {np.count_nonzero(wrong_cement)}")

# TOTAL VOLUME
total_volume_components = ['POR', 'Mud and matrix', 'TOTGRN', 'Total Cement']
volume_fractions_sum = data[total_volume_components].sum(axis=1).values
wrong_volume = np.logical_not(np.isclose(volume_fractions_sum, 100))
print(f"Number of incorrectly assigned volume fractions: {np.count_nonzero(wrong_volume)}")
Number of incorrectly assigned grain fractions: 96
Number of incorrectly assigned porosity fractions: 101
Number of incorrectly assigned cement fractions: 1
Number of incorrectly assigned volume fractions: 1

There is a significant number of samples with errors. The decision I made was to discard samples with errors in porosity, cement or volume related features and discard grain fraction features as too unreliable, so that I can keep the samples with errors in grain features alone.

Before moving forward, I log-transformed permeability values and then normalized all the features to zero mean and unit variance:

from sklearn.preprocessing import StandardScaler
from math import log10

X = data_red.drop('PERM', axis=1).values
y = data_red['PERM'].apply(lambda x: log10(x)).values

X_std = StandardScaler().fit_transform(X)

After that I performed feature selection using a range of selection algorithms: Pearson correlation, distance correlation (based on this gist), ranking based on elastic net regression, random forest ranking, stability selection using randomized lasso and recursive feature elimination.

from sklearn.ensemble import RandomForestRegressor 
from sklearn.feature_selection import f_regression
from sklearn.linear_model import ElasticNet

cols = data_red.drop('PERM', axis=1).columns
feature_rankings = pd.DataFrame(index=cols)

# univariate feature selection
def dist(x, y):
    return np.abs(x[:, np.newaxis] - y)

def d_n(x):
    d = dist(x, x)
    dn = d - d.mean(0) - d.mean(1)[:, np.newaxis] + d.mean()
    return dn

def dcov_all(x, y):
    dnx = d_n(x)
    dny = d_n(y)
    
    denom = np.product(dnx.shape)
    dc = (dnx * dny).sum() / denom
    dvx = (dnx**2).sum() / denom
    dvy = (dny**2).sum() / denom
    dr = dc / (np.sqrt(dvx) * np.sqrt(dvy))
    return dc, dr, dvx, dvy

f_test, _ = f_regression(X_std, y)
f_test /= np.max(f_test)

dc_scores = []
for i in range(X_std.shape[1]):
    dc, _, _, _ = dcov_all(X_std[:, i], y)
    dc_scores.append(dc)
dc_scores = np.array(dc_scores)
dc_scores /= np.max(dc_scores)

feature_rankings['Pearson'] = f_test
feature_rankings['DC'] = dc_scores

# linear models with regularization
elastic_net = ElasticNet(alpha=0.7, l1_ratio=0.1)

elastic_net.fit(X_std, y)
feature_rankings['ElasticNet'] = elastic_net.coef_

# random forest feature importance
rf = RandomForestRegressor(random_state=0)
rf.fit(X_std, y)
feature_rankings['RF'] = rf.feature_importances_ / np.max(rf.feature_importances_)

# stability selection
from sklearn.linear_model import RandomizedLasso
rlasso = RandomizedLasso(alpha=0.005, n_jobs=-1)
rlasso.fit(X_std, y)
feature_rankings['Stability'] = rlasso.scores_

# recursive feature elimination
from sklearn.feature_selection import RFE
selector = RFE(elastic_net, n_features_to_select=1)
selector.fit(X_std, y)
feature_rankings['RFE'] = selector.ranking_
print(feature_rankings)
Pearson DC ElasticNet RF Stability RFE
DEPTH 0.012409 0.043601 -0.000000 1.775766e-02 0.000 52
POR 1.000000 1.000000 0.300875 1.000000e+00 1.000 1
Mud and matrix 0.979700 0.944287 -0.232541 2.740140e-01 0.695 2
TOTGRN 0.329289 0.512823 0.125154 3.748489e-02 0.320 3
DoLo 0.160897 0.348428 -0.114194 2.268208e-02 0.805 4
Total Cement 0.072758 0.182865 0.000000 7.463309e-03 0.035 21
FC 0.091295 0.186574 0.000000 5.180301e-03 0.020 20
MO 0.066512 0.169349 0.000000 1.397019e-03 0.000 19
EC 0.000186 0.004129 0.000000 4.406653e-03 0.020 22
Interparticle (BP) 0.437091 0.652144 0.150489 2.734478e-02 0.290 6
Moldic (MO) 0.022927 0.065443 0.071612 1.832028e-02 0.270 11
WP intraparticle 0.021676 0.051974 0.000000 1.573746e-02 0.000 28
BC Intercrystalline 0.007280 0.014063 -0.000000 1.983509e-04 0.000 34
OTHERPOR 0.000575 0.002267 0.000000 6.326667e-03 0.100 35
Anhy 0.002769 0.002500 0.000000 8.223801e-04 0.005 23
FACIES_BCGI 0.001095 0.018077 0.020526 1.116013e-02 0.420 15
FACIES_CLADO 0.011248 0.017371 0.000000 3.931908e-04 0.000 33
FACIES_MIC 0.355562 0.596872 -0.104562 9.380061e-03 0.535 5
FACIES_SO 0.028381 0.086697 -0.000000 8.336911e-03 0.270 36
FACIES_SRAC 0.046250 0.107694 0.082938 4.219528e-03 0.615 7
TEXTURE_B 0.003209 0.002197 0.000000 1.573111e-07 0.000 40
TEXTURE_G 0.114598 0.212404 0.112852 6.026094e-03 0.525 8
TEXTURE_M 0.149274 0.175061 -0.083807 2.027910e-03 0.460 10
TEXTURE_MLP 0.181436 0.474996 0.089893 6.848541e-05 0.190 9
TEXTURE_P 0.023254 0.261818 -0.009810 1.449022e-02 0.070 17
TEXTURE_W 0.116260 0.226927 -0.042081 3.227577e-03 0.060 12
FIELD_ANDR 0.008156 0.006036 0.000000 7.963694e-05 0.000 51
FIELD_HRDH 0.001377 0.006590 0.000000 2.866362e-03 0.085 30
FIELD_HWYH 0.000298 0.001514 -0.000000 1.839563e-04 0.000 29
FIELD_SDGM 0.000316 0.008238 0.000000 2.149841e-03 0.000 27
FIELD_UTMN 0.004103 0.017334 -0.000000 1.365817e-03 0.110 25
ZONE_1 0.001849 0.002020 -0.000000 3.129510e-03 0.000 31
ZONE_2A 0.094149 0.238360 0.000000 1.640882e-03 0.000 32
ZONE_2B 0.023751 0.083241 0.026671 1.543156e-03 0.120 13
ZONE_3A 0.055387 0.167929 -0.008095 1.245921e-03 0.035 18
ZONE_3B 0.035680 0.073229 -0.000000 1.415927e-03 0.000 26
ZONE_4 0.043935 0.042175 -0.025119 1.454697e-03 0.105 14
WELL_32 0.003606 0.008269 -0.000000 9.772440e-03 0.030 24
WELL_45 0.008116 0.015835 0.016670 3.000428e-03 0.365 16
WELL_46 0.005061 0.005595 -0.000000 1.730220e-03 0.000 37
WELL_71 0.001412 0.002288 0.000000 3.930716e-04 0.000 38
WELL_98 0.000298 0.001514 -0.000000 1.102958e-03 0.000 39
WELL_193 0.004559 0.008605 0.000000 1.978925e-04 0.000 41
WELL_217 0.008156 0.006036 0.000000 5.853586e-05 0.000 42
WELL_222 0.001755 0.010385 0.000000 5.463477e-04 0.000 43
WELL_527 0.000064 0.004559 0.000000 1.092136e-03 0.000 44
WELL_530 0.000370 0.002327 -0.000000 5.450152e-04 0.000 45
WELL_547 0.000380 0.007311 -0.000000 2.273051e-03 0.000 46
WELL_586 0.000012 0.001414 0.000000 9.969440e-04 0.000 47
WELL_602 0.009914 0.024923 -0.000000 8.919203e-04 0.000 48
WELL_603 0.004082 0.006328 -0.000000 5.021024e-07 0.010 49
WELL_660 0.000223 0.006313 -0.000000 5.037259e-05 0.000 50

Based on these results I discarded data on sample location: field, geologic zone, well and depth. I also excluded anhydrite fraction as irrelevant.

After I was happy with the feature set, I moved on with nested cross-validation to pick the most appropriate model for training, but all the regression algorithms I tested returned very similar resulting R2 metrics. Therefore I decided to examine the learning curves for different models. I began with having only porosity and mud as input features (original hypothesis offered by my colleague) and applied kernel ridge and gradient boosting regression models:

krr_lc_porosity_mudgbr_lc_porosity_mud

As one might expect, the tree-based method definitely overfits the data, while the quick plateauing of the training accuracy of kernel ridge regression indicates a high bias level so we need to add additional features to our samples. Adding all major volumetric components (porosity, mud, grains and cement total fractions) does not alter the result, but adding dolomite fraction definitely improves the prediction:

krr_lc_porosity_mud_dologbr_lc_porosity_mud

Any additional features simply increase the variance while not improving the accuracy, with using all features being the most extreme case:

krr_lc_allgbr_lc_all

Finally, here is the residuals graph for the kernel ridge regression with total volume fractions and dolomite fraction as features:

residuals

Residual distribution is definitely not random, we can see patterns and clustering at high permeability values. The resulting prediction errors are about an order of magnitude. So, in conclusion I would say that the measured petrographic variables are definitely not sufficient to explain the permeability behavior of the studied carbonate rocks. In hindsight, it is not very surprising because volumetric fractions do not carry any information on how the pore sizes are distributed, which is a key control on permeability.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s