Consider the following Python code, which implements both regression and classification models to predict molecular activity:

import random
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier

random.shuffle(compound_data)
split_idx = int(len(compound_data) * 0.6)
train_set, test_set = compound_data[:split_idx], compound_data[split_idx:]

features = ["logP", "num_hbd", "num_hba", "mw", "num_rotatable_bonds"]
target_reg = "pKi"
target_cls = "status" # "Active" or "Inactive"

X_train_reg = [[mol[feat] for feat in features] for mol in train_set]
y_train_reg = [mol[target_reg] for mol in train_set]
X_test_reg = [[mol[feat] for feat in features] for mol in test_set]
y_test_reg = [mol[target_reg] for mol in test_set]

X_train_cls = X_train_reg # Using same features as regression
y_train_cls = [1 if mol[target_cls] == "Active" else 0 for mol in train_set]
X_test_cls = X_test_reg
y_test_cls = [1 if mol[target_cls] == "Active" else 0 for mol in test_set]

reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)
print("R² on test set:", reg_model.score(X_test_reg, y_test_reg))

cls_model = RandomForestClassifier()
cls_model.fit(X_train_cls, y_train_cls)
print("Accuracy on test set:", cls_model.score(X_test_cls, y_test_cls))

Based on this code, which of the following statements is most accurate?

Question

Consider the following Python code, which implements both regression and classification models to predict molecular activity:

import random
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier

random.shuffle(compound_data)
split_idx = int(len(compound_data) * 0.6)
train_set, test_set = compound_data[:split_idx], compound_data[split_idx:]

features = ["logP", "num_hbd", "num_hba", "mw", "num_rotatable_bonds"]
target_reg = "pKi"
target_cls = "status"  # "Active" or "Inactive"

X_train_reg = [[mol[feat] for feat in features] for mol in train_set]
y_train_reg = [mol[target_reg] for mol in train_set]
X_test_reg = [[mol[feat] for feat in features] for mol in test_set]
y_test_reg = [mol[target_reg] for mol in test_set]

X_train_cls = X_train_reg  # Using same features as regression
y_train_cls = [1 if mol[target_cls] == "Active" else 0 for mol in train_set]
X_test_cls = X_test_reg
y_test_cls = [1 if mol[target_cls] == "Active" else 0 for mol in test_set]

reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)
print("R² on test set:", reg_model.score(X_test_reg, y_test_reg))

cls_model = RandomForestClassifier()
cls_model.fit(X_train_cls, y_train_cls)
print("Accuracy on test set:", cls_model.score(X_test_cls, y_test_cls))

Based on this code, which of the following statements is most accurate?

BlackTom AI · Accepted Answer

We start by restating the problem setup in our own words to ground the discussion: the code trains two models on molecular data using the same feature set, but for different targets and purposes — a regression model (LinearRegression) to predict a continuous activity value (pKi), and a classification model (RandomForestClassifier) to predict a binary status (Active/Inactive). The question asks which statement best describes when to use each type of model given this setup.

Option 1: "A model like cls_model is preferred if the data set contains only active compounds, but no inactives." This is flawed because a classifier requires examples of both classes to learn a decision boundary; if there are no inactives, the model cannot learn what distinguishes Active from Inactive, and even if it could memorize, there would be no meaningful binary task to perform. In practice, lack of one class leads to degenerate learning and poor generalization.

Option 2: "A model like reg_model is better than one like cls_model because it provides more detailed predictions rather than just a binary label." This statement captures a real distinction: regression outputs a continuous value (pKi), giving more nuanced information about activity strength, whereas classification yields a discrete label. However, this increased granularity is advantageous only when the problem genuinely benefits from a continuous target and when the evaluative metrics and data support precise estimation. The claim could be overly broad if the goal is simply making yes/no decisions or if the data quality favors classification accuracy over exact numeric prediction.

Option 3: "A model like reg_model is preferred when predicting continuous activity values (e.g., pKi), while a model like cls_model is useful when distinguishing between active and inactive compounds." This option aligns with standard machine learning practice: regression is chosen for continuous targets, and classification is chosen for categorical targets. The code demonstrates this by using LinearRegression for pKi and RandomForestClassifier for Active/Inactive status, reflecting appropriate task-appropriate modeling choices. It also implicitly acknowledges that the same feature set can be repurposed for different target types without changing the feature engineering.

Option 4: "cls_model is better than reg_model because accuracy is easier to interpret than R²." This statement appeals to interpretability, but it makes a broad comparative claim that ignores context. Interpretability depends on the model type and the user’s needs; accuracy (or error metrics) and R² measure different aspects of model performance for different tasks. Moreover, classification accuracy and regression R² are not directly comparable across tasks, so asserting one is categorically easier to interpret than the other oversimplifies the issue and ignores the practical evaluation goals for each problem.

Putting these evaluations together, you can see that the core idea behind choosing between regression and classification hinges on the nature of the target variable and the decision-making goal: continuous prediction versus discrete labeling. The third option articulates this alignment most clearly and coherently with the described code and typical modeling conventions, while the other options contain contextual or methodological problems that raise questions about when and why a given model type should be used.

In summary, the arguments presented in each option hinge on the type of target variable, the availability and balance of class labels, the granularity of predictions, and how interpretability and evaluation metrics relate to the task at hand. The reasoning behind option three centers on the conventional mapping: continuous targets go with regression, categorical targets go with classification, and the code exemplifies this pairing with appropriate model choices.

2254 BIOSC 1544 SEC1000 Exam #2

View Explanation

Log in for full answers

Similar Questions

In a consumer society, many adults channel creativity into buying things

Economic stress and unpredictable times have resulted in a booming industry for self-help products

People born without creativity never can develop it

A product has a selling price of $20, a contribution margin ratio of 40% and fixed cost of $120,000. To make a profit of $30,000. The number of units that must be sold is: Type the number without $ and a comma. Eg: 20000

Which of the following statement regarding cost is correct:

Under the assumptions used in cost-volume-profit analysis, as the activity increases:

The CVP and break-even analysis are supported by a range of important assumptions. Which of the following is not an assumption of the CVP:

The break-even point is that at which:

More Practical Tools for Students Powered by AI Study Helper

Homework AI Solver

Stylized AI Paper Writer

Plagiarism Checker Assistant

Citation AI Academic Writing Tool

In-Class Translation Assistant

AI Note Generator

AI Quiz Answers

Past Exam Questions from University Test Bank

Smart Practice Assistant

Adaptive Practice

Making Your Study Simpler