A machine learning engineer has identified the best run from an MLflow Experiment. They have stored the run ID in the run_id variable and identified the logged model name as "model". They now want to register that model in the MLflow Model Registry with the name "best_model".
Which lines of code can they use to register the model associated with run_id to the MLflow Model Registry?
A. mlflow.register_model(run_id, "best_model")
B. mlflow.register_model(f"runs:/{run_id}/model", "best_model")
C. millow.register_model(f"runs:/{run_id)/model")
D. mlflow.register_model(f"runs:/{run_id}/best_model", "model")
Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?
A. The vectorized pandas UDFs allow for the use of type hints
B. The vectorized pandas UDFs process data in batches rather than one row at a time
C. The vectorized pandas UDFs allow for pandas API use inside of the function
D. The vectorized pandas UDFs work on distributed DataFrames
E. The vectorized pandas UDFs process data in memory rather than spilling to disk
A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation
when k > 2.
Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?
A. A holdout set is not necessary when using a train-validation split
B. Reproducibility is achievable when using a train-validation split
C. Fewer hyperparameter values need to be tested when usinga train-validation split
D. Bias is avoidable when using a train-validation split
E. Fewer models need to be trained when using a train-validation split
A machine learning engineer is trying to scale a machine learning pipelinepipelinethat contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:
A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to theestimatorparameter and then placing the updated cv object as the final stage of thepipelinein place of the original model.
Which of the following is a negative consequence of the approach suggested by the colleague?
A. The model will take longerto train for each unique combination of hvperparameter values
B. The feature engineering stages will be computed using validation data
C. The cross-validation process will no longer be
D. The cross-validation process will no longer be reproducible
E. The model will be refit one more per cross-validation fold
A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.
Which of the following terms is used to describe this combination of models?
A. Bootstrap aggregation
B. Support vector machines
C. Bucketing
D. Ensemble learning
E. Stacking
A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.
Which of the following classification metrics should be used to evaluate the model?
A. RMSE
B. Precision
C. Area under the residual operating curve
D. Accuracy
E. Recall
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?
A. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
B. One-hot encoding is dependent on the target variable's values which differ for each apaplication.
C. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.
A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:
They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:
Which of the following lines of code can be used to complete the code block to successfully complete the task?
A. predict(*spark_df.columns)
B. mapInPandas(predict)
C. predict(Iterator(spark_df))
D. mapInPandas(predict(spark_df.columns))
E. predict(spark_df.columns)
A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:
prediction DOUBLE actual DOUBLE
Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?
A. Option A
B. Option B
C. Option C
D. Option D
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?
A. One-hot encoding is not supported by most machine learning libraries.
B. One-hot encoding is dependent on the target variable's values which differ for each application.
C. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.
E. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.