Which of the following code blocks returns a copy of DataFrame transactionsDf where the column storeId has been converted to string type?
A. transactionsDf.withColumn("storeId", convert("storeId", "string"))
B. transactionsDf.withColumn("storeId", col("storeId", "string"))
C. transactionsDf.withColumn("storeId", col("storeId").convert("string"))
D. transactionsDf.withColumn("storeId", col("storeId").cast("string"))
E. transactionsDf.withColumn("storeId", convert("storeId").as("string"))
Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?
A. transactionsDf.drop(["predError", "value"])
B. transactionsDf.drop("predError", "value")
C. transactionsDf.drop(col("predError"), col("value"))
D. transactionsDf.drop(predError, value)
E. transactionsDf.drop("predError and value")
Which of the following code blocks creates a new one-column, two-row DataFrame dfDates with column date of type timestamp?
A. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) 2.dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))
B. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) 2.dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-ddHH:mm:ss"))
C. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) 2.dfDates = dfDates.withColumn("date", to_timestamp("date", "dd/MM/yyyy HH:mm:ss"))
D. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) 2.dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-ddHH:mm:ss"))
E. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
Which of the following code blocks generally causes a great amount of network traffic?
A. DataFrame.select()
B. DataFrame.coalesce()
C. DataFrame.collect()
D. DataFrame.rdd.map()
E. DataFrame.count()
The code block displayed below contains multiple errors. The code block should remove column transactionDate from DataFrame transactionsDf and add a column transactionTimestamp in which
dates that are expressed as strings in column transactionDate of DataFrame transactionsDf are converted into unix timestamps. Find the errors.
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+----------------+
2.|transactionId|predError|value|storeId|productId| f| transactionDate| 3.+-------------+---------+-----+-------+---------+----+----------------+
4.| 1| 3| 4| 25| 1|null|2020-04-26 15:35|
5.| 2| 6| 7| 2| 2|null|2020-04-13 22:01|
6.| 3| 3| null| 25| 3|null|2020-04-02 10:53|
7.+-------------+---------+-----+-------+---------+----+----------------+
Code block:
1.transactionsDf = transactionsDf.drop("transactionDate")
2.transactionsDf["transactionTimestamp"] = unix_timestamp("transactionDate", "yyyy-MM- dd")
A. Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used instead of the existing column assignment. Operator to_unixtime() should be used instead of unix_timestamp().
B. Column transactionDate should be dropped after transactionTimestamp has been written. The withColumn operator should be used instead of the existing column assignment. Column transactionDate should be wrapped in a col() operator.
C. Column transactionDate should be wrapped in a col() operator.
D. The string indicating the date format should be adjusted. The withColumnReplaced operator should be used instead of the drop and assign pattern in the code block to replace column transactionDate with the new column transactionTimestamp.
E. Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used instead of the existing column assignment.
The code block shown below should return an exact copy of DataFrame transactionsDf that does not include rows in which values in column storeId have the value 25. Choose the answer that correctly fills the blanks in the code block to accomplish this.
A. transactionsDf.remove(transactionsDf.storeId==25)
B. transactionsDf.where(transactionsDf.storeId!=25)
C. transactionsDf.filter(transactionsDf.storeId==25)
D. transactionsDf.drop(transactionsDf.storeId==25)
E. transactionsDf.select(transactionsDf.storeId!=25)
The code block displayed below contains an error. The code block should display the schema of DataFrame transactionsDf. Find the error.
Code block:
transactionsDf.rdd.printSchema
A. There is no way to print a schema directly in Spark, since the schema can be printed easily through using print(transactionsDf.columns), so that should be used instead.
B. The code block should be wrapped into a print() operation.
C. PrintSchema is only accessible through the spark session, so the code block should be rewritten as spark.printSchema(transactionsDf).
D. PrintSchema is a method and should be written as printSchema(). It is also not callable through transactionsDf.rdd, but should be called directly from transactionsDf.
E. PrintSchema is a not a method of transactionsDf.rdd. Instead, the schema should be printed via transactionsDf.print_schema().
In which order should the code blocks shown below be run in order to return the number of records that are not empty in column value in the DataFrame resulting from an inner join of DataFrame transactionsDf and itemsDf on columns productId and itemId, respectively?
1.
.filter(~isnull(col('value')))
2.
.count()
3.
transactionsDf.join(itemsDf, col("transactionsDf.productId")==col("itemsDf.itemId"))
4.
transactionsDf.join(itemsDf, transactionsDf.productId==itemsDf.itemId, how='inner')
5.
.filter(col('value').isnotnull())
6.
.sum(col('value'))
A. 4, 1, 2
B. 3, 1, 6
C. 3, 1, 2
D. 3, 5, 2
E. 4, 6
Which of the following code blocks performs an inner join between DataFrame itemsDf and DataFrame transactionsDf, using columns itemId and transactionId as join keys, respectively?
A. itemsDf.join(transactionsDf, "inner", itemsDf.itemId == transactionsDf.transactionId)
B. itemsDf.join(transactionsDf, itemId == transactionId)
C. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.transactionId, "inner")
D. itemsDf.join(transactionsDf, "itemsDf.itemId == transactionsDf.transactionId", "inner")
E. itemsDf.join(transactionsDf, col(itemsDf.itemId) == col(transactionsDf.transactionId))
The code block displayed below contains an error. The code block should return DataFrame transactionsDf, but with the column storeId renamed to storeNumber. Find the error.
Code block:
transactionsDf.withColumn("storeNumber", "storeId")
A. Instead of withColumn, the withColumnRenamed method should be used.
B. Arguments "storeNumber" and "storeId" each need to be wrapped in a col() operator.
C. Argument "storeId" should be the first and argument "storeNumber" should be the second argument to the withColumn method.
D. The withColumn operator should be replaced with the copyDataFrame operator.
E. Instead of withColumn, the withColumnRenamed method should be used and argument "storeId" should be the first and argument "storeNumber" should be the second argument to that method.