What is LOESS used for?
A. It fits a smoothed curve to scatterplot data, to give a general sense of the data's behavior.
B. It is a significance test for the correlation between two variables.
C. It plots a continuous variable versus a discrete variable, to compare distributions across classes.
D. It is run after a one-way ANOVA, to determine which population has the highest mean value.
What is an appropriate data visualization to use in a presentation to a project sponsor?
A. Bar Chart
B. Pie Chart
C. Box and Whisker Plot
D. Density Plot
Which data asset is an example of semi-structured data?
A. XML data file
B. Database table
C. Webserver log
D. News article
Assume you are performing an analysis to determine fraud detection on credit card usage. You will need to ensure higher-risk transactions. These may indicate that fraudulent credit card activity is retained in your data for analysis and not dropped as outliers during pre- processing.
What is the approach for loading data into the analytical sandbox for this analysis?
A. ELT
B. ETL
C. EDW
D. OLTP
The web analytics team uses Hadoop to process access logs. They now want to correlate this data with structured user data residing in their massively parallel database. Which tool should they use to export the structured data from Hadoop?
A. Sqoop
B. Pig
C. Chukwa
D. Scribe
Based on the exhibit, the table shows the values for the input Boolean attributes A, B, and
C. In addition, the exhibit shows the values for the output attribute "class".
Which decision tree is valid for the data?
A. Tree A
B. Tree B
C. Tree C
D. Tree D
You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. All the data currently available to you has been loaded into your analytics database; revenue data, pricing data, and online transaction data. You find that all the data comes in different levels of granularity. The transaction data has timestamps (day, hour, minutes, seconds), pricing is stored at the daily level, and revenue data is only reported monthly. What is your next step?
A. Report back to the business owner that the current data model does not support the business question.
B. Interpolate a daily model for revenue from the monthly revenue data.
C. Aggregate all data to the monthly level in order to create a monthly revenue model.
D. Disregard revenue as a driver in the pricing model, and create a daily model based on pricing and transactions only.
When would you prefer a Naive Bayes model to a logistic regression model for classification?
A. When you are using several categorical input variables with over 1000 possible values each.
B. When you need to estimate the probability of an outcome, not just which class it is in.
C. When all the input variables are numerical.
D. When some of the input variables might be correlated.
A disk drive manufacturer has a defect rate of less than 1.5% with 98% confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units. Which action should the team recommend?
A. The manufacturing process is functioning properly and no further action is required
B. A larger sample size should be taken to determine if the plant is operating correctly
C. A smaller sample size should be taken to determine if the plant is operating correctly
D. There is a flaw in the quality assurance process and the sample should be repeated
Which word or phrase completes the statement? A data warehouse is to a centralized database for reporting as an analytic sandbox is to a _______?
A. Collection of data assets for modeling
B. Collection of low-volume databases
C. Centralized database of KPIs
D. Collection of data assets for ETL