In Lecture 2, we built a classifier between human-written password (e.g., WinterDragon99!) and random password (e.g., 2@*7N!bx?2c). We designed features, e.g., the number of consecutive letters and numbers.

Now you need to work on a modified problem: we removed all numbers and obtained a new dataset: https://github.com/liususan091219/cs541/blob/main/lectures/lecture3/. However, the old feature now only achieves error rate = 0.36 on this new dataset. Observe this new dataset, design features to improve this error rate.

You should start by reproducing this error rate on the notebook below, then revise featureExtractor to reduce the error rate to below 0.2:

https://colab.research.google.com/drive/16MFcWCs7H44lVSjzAf8y3PhqHvm8xfMB?usp=sharing
Links to an external site.

Note: You must have entered the correct answer before 6:50 to receive the bonus points. No bonus point if getting the correct answer after 6:50. 1.5 bonus points if error rate

Question

In Lecture 2, we built a classifier between human-written password (e.g., WinterDragon99!) and random password (e.g., 2@*7N!bx?2c). We designed features, e.g., the number of consecutive letters and numbers.

Now you need to work on a modified problem: we removed all numbers and obtained a new dataset: https://github.com/liususan091219/cs541/blob/main/lectures/lecture3/. However, the old feature now only achieves error rate = 0.36 on this new dataset. Observe this new dataset, design features to improve this error rate.

You should start by reproducing this error rate on the notebook below, then revise featureExtractor to reduce the error rate to below 0.2:

https://colab.research.google.com/drive/16MFcWCs7H44lVSjzAf8y3PhqHvm8xfMB?usp=sharing
Links to an external site.

Note: You must have entered the correct answer before 6:50 to receive the bonus points. No bonus point if getting the correct answer after 6:50. 1.5 bonus points if error rate < 0.2. Raise your hand if you achieved an error rate < 0.2.

BlackTom AI · Accepted Answer

The prompt describes a machine learning task rather than a multiple-choice question, so we’ll reason through the problem setup and potential feature-design directions step by step.

First, restating the task in my own words: you had a classifier that distinguished human-written passwords from random passwords using features like the number of consecutive letters and numbers. On a modified dataset where all numbers were removed, the old feature set yielded an error rate of 0.36. The goal is to inspect the new dataset, reproduce that 0.36 error rate, and then engineer features to push the error rate down to below 0.2, ideally via a revised featureExtractor.

Next, consider why removing all digits from passwords could degrade the original feature effectiveness: features that relied on numeric patterns (such as runs of digits, digit-place usage, or digit-letter transitions) become less informative or even misleading when digits are absent in the data. Consequently, the model may rely on less discriminative cues, increasing error.

Now, outline a strategy to reproduce the 0.36 error rate:
- Load the new dataset from the provided repository and verify the data format, labels, and train/test split. Ensure you reproduce the same train/test split as in the original notebook if possible, or document any differences.
- Re-implement or run the existing featureExtractor that previously used numeric-related features (e.g., number of consecutive letters and numbers) to see how it behaves on the no-numbers dataset.
- Evaluate the baseline model to confirm the reported error rate of approximately 0.36, noting metrics (accuracy, precision, recall, F1) and any class imbalances that might affect results.

With the baseline reproduced, proceed to design features tailored to the no-numbers setting. Consider the following directions, explaining why each could help reduce error:
- Switch to features that capture character-type patterns independent of digits:
  - Proportions of upper-case vs lower-case letters, presence of special characters, and occurrence of common password motifs (e.g., capitalization patterns, repeated substrings).
  - Distributional features such as entropy estimates of character sequences, n-gram character frequencies (e.g., 2- to 4-grams) over the alphabet present in the data.
  - Length-related features (password length, run-length encoding statistics, presence of palindrome-like structures).
- Incorporate context-aware sequence features:
  - Sliding-window character class transitions (e.g., transitions from letters to symbols, or consecutive identical characters) to capture patterns beyond simple counts.
  - Position-wise features such as first/last character class, or whether the string starts/ends with a particular class of characters.
- Regularization- and normalization-friendly features:
  - Normalize features by password length to mitigate length variances, especially when numbers are removed.
  - Use feature scaling or standardization if the model benefits from it (e.g., SVM with linear kernel, logistic regression).
- Model-appropriate feature engineering strategies:
  - If using logistic regression or linear classifiers, ensure feature selection or L1/L2 regularization helps avoid overfitting to idiosyncratic password artifacts.
  - If using tree-based methods (random forest, gradient boosting), richer discrete features like n-grams and categorical encodings can improve splits, but be mindful of overfitting and computational cost.

Practical steps to implement the feature changes:
- Start by adding features based on character-type distributions and n-gram statistics that do not rely on numeric characters. Compute simple descriptive features (counts, proportions, entropy) for each password.
- Experiment with character-level n-gram features (2- to 4-grams) extracted from the string, then apply a linear or tree-based model with appropriate regularization.
- Evaluate incremental improvements by ablation: remove or add each feature group to see its impact on error rate, ensuring you can trace which features contribute most.
- Validate the final featureExtractor by re-running the evaluation on the same train/test split and aiming for an error rate below 0.2.

Potential pitfalls and diagnostics:
- If class imbalance exists, accuracy alone may be misleading; examine precision, recall, and F1 to understand the trade-offs.
- Beware feature leakage from the split; ensure no information from the test set is used during feature engineering.
- If the data distribution shifts due to no numbers, some previously informative cues may become noninformative, so you may need to redesign features from first principles rather than rely on prior heuristics.

Final note: since the actual answer is not multiple-choice and there is no explicit correct option to select, the key objective is to methodically reproduce the baseline error and then iteratively enhance the featureExtractor with the no-numbers focus, documenting each feature’s contribution to reducing the error below 0.2.

Artificial Intelligence Lecture 3 quiz

查看解析

登录即可查看完整答案

类似问题

Which of the following is NOT an advantage of feature engineering?

Why don’t we use the ID (e.g. student ID, social security number) as an input variable in a prediction problem?

In a consumer society, many adults channel creativity into buying things

Economic stress and unpredictable times have resulted in a booming industry for self-help products

People born without creativity never can develop it

A product has a selling price of $20, a contribution margin ratio of 40% and fixed cost of $120,000. To make a profit of $30,000. The number of units that must be sold is: Type the number without $ and a comma. Eg: 20000

Which of the following statement regarding cost is correct:

更多留学生实用工具

考试浏览器助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单