Consider the following Python code:

import random
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

random.shuffle(active_compounds)
split_index = int(len(active_compounds) * 0.6) # Use 60% for training
train_set, test_set = active_compounds[:split_index], active_compounds[split_index:]

features = ["logP", "num_hbd", "num_hba", "mw", "num_rotatable_bonds"]
target = "pKi"

X_train = [[molecule[feat] for feat in features] for molecule in train_set]
y_train = [molecule[target] for molecule in train_set]
X_test = [[molecule[feat] for feat in features] for molecule in test_set]
y_test = [molecule[target] for molecule in test_set]

model = LinearRegression()
model.fit(X_train, y_train)

def plot_predictions(model, X, y):
y_pred = model.predict(X)
plt.scatter(y, y_pred)
plt.xlabel("True pKi (uM)")
plt.ylabel("Predicted pKi (uM)")
plt.show()
print("R² score:", model.score(X, y))

print("Training set evaluation:")
plot_predictions(model, X_train, y_train)

print("Testing set evaluation:")
plot_predictions(model, X_test, y_test)

Based on this code, which of the following statements best describes a potential issue or limitation in how the model is trained or evaluated?

Question

Consider the following Python code:

import random
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

random.shuffle(active_compounds)
split_index = int(len(active_compounds) * 0.6)  # Use 60% for training
train_set, test_set = active_compounds[:split_index], active_compounds[split_index:]

features = ["logP", "num_hbd", "num_hba", "mw", "num_rotatable_bonds"]
target = "pKi"

X_train = [[molecule[feat] for feat in features] for molecule in train_set]
y_train = [molecule[target] for molecule in train_set]
X_test = [[molecule[feat] for feat in features] for molecule in test_set]
y_test = [molecule[target] for molecule in test_set]

model = LinearRegression()
model.fit(X_train, y_train)

def plot_predictions(model, X, y):
    y_pred = model.predict(X)
    plt.scatter(y, y_pred)
    plt.xlabel("True pKi (uM)")
    plt.ylabel("Predicted pKi (uM)")
    plt.show()
    print("R² score:", model.score(X, y))

print("Training set evaluation:")
plot_predictions(model, X_train, y_train)

print("Testing set evaluation:")
plot_predictions(model, X_test, y_test)

Based on this code, which of the following statements best describes a potential issue or limitation in how the model is trained or evaluated?

BlackTom AI · Accepted Answer

The question presents a Python snippet that trains a simple LinearRegression model on a subset of molecular descriptors and then evaluates it on both the training and test sets using a custom plotting function. We will evaluate each option in turn.

Option 1: The R² score is not a valid metric for evaluating regression models.
- This claim is not accurate in general. R² (coefficient of determination) is a standard metric for regression that measures how well the predicted values approximate the true values relative to the mean of the observed data. While it has limitations (e.g., it can be misleading for highly skewed data or non-linear residuals), it is still a valid and commonly used evaluation metric for regression models. The code calls model.score(X, y), which, for sklearn LinearRegression, returns the R² score, so there is nothing inherently wrong with using R² here. Consequently, this statement overstates a limitation and is not correct in the given context.

Option 2: The function plot_predictions incorrectly plots the true values on the x-axis and predictions on the y-axis.
- In the function, the scatter plot is created with plt.scatter(y, y_pred). Here, y is passed as the first argument (x-axis) and y_pred as the second argument (y-axis), which matches how the axis labels are set: xlabel is "True pKi (uM)" and ylabel is "Predicted pKi (uM)". Since the code intentionally plots true values on the x-axis and predicted values on the y-axis, this statement is actually describing what the code does, not a misplot. Therefore, this option incorrectly identifies a flaw that does not exist in the snippet.

Option 3: The model cannot be used for prediction because it was trained only on a subset of molecular descriptors instead of the full dataset.
- Training a model on a subset of features does not by itself make predictions invalid for data that include those features; if a new instance contains the same features, the model can generate predictions. The concern would be more about whether the chosen features are sufficient or whether regularization or feature engineering is needed. However, the code demonstrates a standard training run with a defined feature list and does not indicate that predictions are inherently impossible due to feature subset selection. Thus, this statement mischaracterizes the situation and is not the best description of a limitation in this context.

Option 4: The model assumes a linear relationship between molecular features and pKi, which may not be valid if the true relationship is nonlinear.
- This is a meaningful and valid limitation. LinearRegression models assume a linear relationship between the input features (logP, num_hbd, num_hba, mw, num_rotatable_bonds) and the target pKi. If the underlying chemistry-phenotype relationship is nonlinear or involves interactions between features that a linear model cannot capture, the model’s predictions may be biased or systematically inaccurate. Recognizing this limitation reflects a sound understanding of model choice and its implications for predictive performance.

Option 5: The code incorrectly selects training and testing data by sorting molecules before splitting, introducing bias.
- The code starts with random.shuffle(active_compounds), which randomizes the order of the list. It then computes a 60% split for training using a fixed index and slices the list into train and test sets. Since the initial shuffle already randomized the order, there is no sorting before splitting that could introduce bias. The claim that the split is biased due to sorting is incorrect given the actual operations performed. Therefore, this option describes a false cause-and-effect in this code.

2254 BIOSC 1544 SEC1000 Exam #2

查看解析

登录即可查看完整答案

类似问题

Which code can be used directly to predict prices for new HDB resale cases using a trained Linear Regression model?

神经网络中需要有多少个神经元才能解决一维的线性回归任务？ How many neurons are necessary in a neural network to solve a linear regression task in one dimension?

The linear regression method for supervised learning assumes that the true regression function is a linear function.

In a consumer society, many adults channel creativity into buying things

Economic stress and unpredictable times have resulted in a booming industry for self-help products

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单