bash-4.2# rm -rf prd_data_to_validate.csv bash-4.2# python3 num_conversion.py --secret=customer-dp-prd-mongodb --srv_connect=yes --exclude_prev_documents=no WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/lib/python3.7/site-packages/pyspark/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release :: loading settings :: url = jar:file:/usr/local/lib/python3.7/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars org.mongodb.spark#mongo-spark-connector added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-44cd953a-4d02-4263-9727-cb79c4359449;1.0 confs: [default] found org.mongodb.spark#mongo-spark-connector;10.0.0 in central found org.mongodb#mongodb-driver-sync;4.5.1 in central [4.5.1] org.mongodb#mongodb-driver-sync;[4.5.0,4.5.99) found org.mongodb#bson;4.5.1 in central found org.mongodb#mongodb-driver-core;4.5.1 in central :: resolution report :: resolve 1919ms :: artifacts dl 7ms :: modules in use: org.mongodb#bson;4.5.1 from central in [default] org.mongodb#mongodb-driver-core;4.5.1 from central in [default] org.mongodb#mongodb-driver-sync;4.5.1 from central in [default] org.mongodb.spark#mongo-spark-connector;10.0.0 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 4 | 1 | 0 | 0 || 4 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-44cd953a-4d02-4263-9727-cb79c4359449 confs: [default] 0 artifacts copied, 4 already retrieved (0kB/8ms) 22/04/13 17:56:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/04/13 17:56:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 264.0 B, free 434.4 MiB) 22/04/13 17:56:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 466.0 B, free 434.4 MiB) 22/04/13 17:56:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 5bcbeccf03ef:42707 (size: 466.0 B, free: 434.4 MiB) 22/04/13 17:56:56 INFO SparkContext: Created broadcast 0 from broadcast at MongoSpark.scala:530 22/04/13 17:56:56 INFO cluster: Cluster created with settings {hosts=[127.0.0.1:27017], srvHost=mongodb-prd.dp.aws.customer.internal, mode=MULTIPLE, requiredClusterType=REPLICA_SET, serverSelectionTimeout='30000 ms', requiredReplicaSetName='prd'} 22/04/13 17:56:56 INFO MongoClientCache: Creating MongoClient: [] 22/04/13 17:56:56 INFO cluster: Cluster description not yet available. Waiting for 30000 ms before timing out 22/04/13 17:56:56 INFO cluster: Adding discovered server rs1-prd.dp.aws.customer.internal:27017 to client view of cluster 22/04/13 17:56:56 INFO cluster: Adding discovered server rs2-prd.dp.aws.customer.internal:27017 to client view of cluster 22/04/13 17:56:56 INFO cluster: No server chosen by com.mongodb.client.internal.MongoClientDelegate$1@1b198b72 from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=rs1-prd.dp.aws.customer.internal:27017, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=rs2-prd.dp.aws.customer.internal:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out 22/04/13 17:56:56 INFO connection: Opened connection [connectionId{localValue:2, serverValue:196735}] to rs2-prd.dp.aws.customer.internal:27017 22/04/13 17:56:56 INFO connection: Opened connection [connectionId{localValue:1, serverValue:199557}] to rs1-prd.dp.aws.customer.internal:27017 22/04/13 17:56:56 INFO cluster: Monitor thread successfully connected to server with description ServerDescription{address=rs1-prd.dp.aws.customer.internal:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, minWireVersion=0, maxWireVersion=13, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=4636901, setName='prd', canonicalAddress=ip-172-19-33-146.eu-central-1.compute.internal:27017, hosts=[ip-172-19-33-146.eu-central-1.compute.internal:27017, ip-172-19-46-161.eu-central-1.compute.internal:27017], passives=[], arbiters=[], primary='ip-172-19-33-146.eu-central-1.compute.internal:27017', tagSet=TagSet{[]}, electionId=7fffffff000000000000000f, setVersion=9, lastWriteDate=Wed Apr 13 17:56:53 UTC 2022, lastUpdateTimeNanos=2870318642892402} 22/04/13 17:56:56 INFO cluster: Monitor thread successfully connected to server with description ServerDescription{address=rs2-prd.dp.aws.customer.internal:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, minWireVersion=0, maxWireVersion=13, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=6158825, setName='prd', canonicalAddress=ip-172-19-46-161.eu-central-1.compute.internal:27017, hosts=[ip-172-19-33-146.eu-central-1.compute.internal:27017, ip-172-19-46-161.eu-central-1.compute.internal:27017], passives=[], arbiters=[], primary='ip-172-19-33-146.eu-central-1.compute.internal:27017', tagSet=TagSet{[]}, electionId=null, setVersion=9, lastWriteDate=Wed Apr 13 17:56:53 UTC 2022, lastUpdateTimeNanos=2870318643377297} 22/04/13 17:56:56 INFO cluster: Adding discovered server ip-172-19-33-146.eu-central-1.compute.internal:27017 to client view of cluster 22/04/13 17:56:56 INFO cluster: Adding discovered server ip-172-19-46-161.eu-central-1.compute.internal:27017 to client view of cluster 22/04/13 17:56:56 INFO connection: Opened connection [connectionId{localValue:3, serverValue:199558}] to ip-172-19-33-146.eu-central-1.compute.internal:27017 22/04/13 17:56:56 INFO cluster: Server rs1-prd.dp.aws.customer.internal:27017 is no longer a member of the replica set. Removing from client view of cluster. 22/04/13 17:56:56 INFO cluster: Monitor thread successfully connected to server with description ServerDescription{address=ip-172-19-33-146.eu-central-1.compute.internal:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, minWireVersion=0, maxWireVersion=13, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=2151240, setName='prd', canonicalAddress=ip-172-19-33-146.eu-central-1.compute.internal:27017, hosts=[ip-172-19-33-146.eu-central-1.compute.internal:27017, ip-172-19-46-161.eu-central-1.compute.internal:27017], passives=[], arbiters=[], primary='ip-172-19-33-146.eu-central-1.compute.internal:27017', tagSet=TagSet{[]}, electionId=7fffffff000000000000000f, setVersion=9, lastWriteDate=Wed Apr 13 17:56:53 UTC 2022, lastUpdateTimeNanos=2870318670642378} 22/04/13 17:56:56 INFO cluster: Server rs2-prd.dp.aws.customer.internal:27017 is no longer a member of the replica set. Removing from client view of cluster. 22/04/13 17:56:56 INFO connection: Opened connection [connectionId{localValue:4, serverValue:196736}] to ip-172-19-46-161.eu-central-1.compute.internal:27017 22/04/13 17:56:56 INFO cluster: Canonical address ip-172-19-33-146.eu-central-1.compute.internal:27017 does not match server address. Removing rs1-prd.dp.aws.customer.internal:27017 from client view of cluster 22/04/13 17:56:56 INFO cluster: Setting max election id to 7fffffff000000000000000f from replica set primary ip-172-19-33-146.eu-central-1.compute.internal:27017 22/04/13 17:56:56 INFO cluster: Setting max set version to 9 from replica set primary ip-172-19-33-146.eu-central-1.compute.internal:27017 22/04/13 17:56:56 INFO cluster: Discovered replica set primary ip-172-19-33-146.eu-central-1.compute.internal:27017 22/04/13 17:56:56 INFO cluster: Monitor thread successfully connected to server with description ServerDescription{address=ip-172-19-46-161.eu-central-1.compute.internal:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, minWireVersion=0, maxWireVersion=13, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=2118074, setName='prd', canonicalAddress=ip-172-19-46-161.eu-central-1.compute.internal:27017, hosts=[ip-172-19-33-146.eu-central-1.compute.internal:27017, ip-172-19-46-161.eu-central-1.compute.internal:27017], passives=[], arbiters=[], primary='ip-172-19-33-146.eu-central-1.compute.internal:27017', tagSet=TagSet{[]}, electionId=null, setVersion=9, lastWriteDate=Wed Apr 13 17:56:53 UTC 2022, lastUpdateTimeNanos=2870318677533116} 22/04/13 17:56:57 INFO connection: Opened connection [connectionId{localValue:5, serverValue:199559}] to ip-172-19-33-146.eu-central-1.compute.internal:27017 22/04/13 17:56:57 INFO SparkContext: Starting job: treeAggregate at MongoInferSchema.scala:88 22/04/13 17:56:57 INFO DAGScheduler: Got job 0 (treeAggregate at MongoInferSchema.scala:88) with 1 output partitions 22/04/13 17:56:57 INFO DAGScheduler: Final stage: ResultStage 0 (treeAggregate at MongoInferSchema.scala:88) 22/04/13 17:56:57 INFO DAGScheduler: Parents of final stage: List() 22/04/13 17:56:57 INFO DAGScheduler: Missing parents: List() 22/04/13 17:56:57 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at treeAggregate at MongoInferSchema.scala:88), which has no missing parents 22/04/13 17:56:57 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.3 KiB, free 434.4 MiB) 22/04/13 17:56:57 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.6 KiB, free 434.4 MiB) 22/04/13 17:56:57 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 5bcbeccf03ef:42707 (size: 3.6 KiB, free: 434.4 MiB) 22/04/13 17:56:57 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1388 22/04/13 17:56:57 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at treeAggregate at MongoInferSchema.scala:88) (first 15 tasks are for partitions Vector(0)) 22/04/13 17:56:57 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0 22/04/13 17:56:57 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (5bcbeccf03ef, executor driver, partition 0, ANY, 4610 bytes) taskResourceAssignments Map() 22/04/13 17:56:57 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 22/04/13 17:57:00 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 7801 bytes result sent to driver 22/04/13 17:57:00 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3175 ms on 5bcbeccf03ef (executor driver) (1/1) 22/04/13 17:57:00 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 22/04/13 17:57:00 INFO DAGScheduler: ResultStage 0 (treeAggregate at MongoInferSchema.scala:88) finished in 3.292 s 22/04/13 17:57:00 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 22/04/13 17:57:00 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished 22/04/13 17:57:00 INFO DAGScheduler: Job 0 finished: treeAggregate at MongoInferSchema.scala:88, took 3.369280 s WARNING:root:Using filer clause: CDC_TIMESTAMP >= "1649841014867" 22/04/13 17:57:02 INFO MongoRelation: requiredColumns: CDC_TIMESTAMP, filters: IsNotNull(CDC_TIMESTAMP), GreaterThanOrEqual(CDC_TIMESTAMP,1649841014867) 22/04/13 17:57:03 INFO CodeGenerator: Code generated in 242.510366 ms 22/04/13 17:57:03 INFO CodeGenerator: Code generated in 49.641356 ms 22/04/13 17:57:03 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 5bcbeccf03ef:42707 in memory (size: 3.6 KiB, free: 434.4 MiB) 22/04/13 17:57:03 INFO SparkContext: Starting job: count at NativeMethodAccessorImpl.java:0 22/04/13 17:57:03 INFO DAGScheduler: Registering RDD 9 (count at NativeMethodAccessorImpl.java:0) as input to shuffle 0 22/04/13 17:57:03 INFO DAGScheduler: Got job 1 (count at NativeMethodAccessorImpl.java:0) with 1 output partitions 22/04/13 17:57:03 INFO DAGScheduler: Final stage: ResultStage 2 (count at NativeMethodAccessorImpl.java:0) 22/04/13 17:57:03 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1) 22/04/13 17:57:03 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1) 22/04/13 17:57:03 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[9] at count at NativeMethodAccessorImpl.java:0), which has no missing parents 22/04/13 17:57:03 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 24.2 KiB, free 434.4 MiB) 22/04/13 17:57:03 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 9.8 KiB, free 434.4 MiB) 22/04/13 17:57:03 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 5bcbeccf03ef:42707 (size: 9.8 KiB, free: 434.4 MiB) 22/04/13 17:57:03 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1388 22/04/13 17:57:03 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[9] at count at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0)) 22/04/13 17:57:03 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks resource profile 0 22/04/13 17:57:03 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1) (5bcbeccf03ef, executor driver, partition 0, ANY, 4599 bytes) taskResourceAssignments Map() 22/04/13 17:57:03 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 22/04/13 17:57:03 INFO CodeGenerator: Code generated in 24.206149 ms 22/04/13 17:57:03 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1941 bytes result sent to driver 22/04/13 17:57:03 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 267 ms on 5bcbeccf03ef (executor driver) (1/1) 22/04/13 17:57:03 INFO DAGScheduler: ShuffleMapStage 1 (count at NativeMethodAccessorImpl.java:0) finished in 0.292 s 22/04/13 17:57:03 INFO DAGScheduler: looking for newly runnable stages 22/04/13 17:57:03 INFO DAGScheduler: running: Set() 22/04/13 17:57:03 INFO DAGScheduler: waiting: Set(ResultStage 2) 22/04/13 17:57:03 INFO DAGScheduler: failed: Set() 22/04/13 17:57:03 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[12] at count at NativeMethodAccessorImpl.java:0), which has no missing parents 22/04/13 17:57:03 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 22/04/13 17:57:03 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 10.1 KiB, free 434.4 MiB) 22/04/13 17:57:03 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 5.0 KiB, free 434.4 MiB) 22/04/13 17:57:03 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 5bcbeccf03ef:42707 (size: 5.0 KiB, free: 434.4 MiB) 22/04/13 17:57:03 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1388 22/04/13 17:57:03 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[12] at count at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0)) 22/04/13 17:57:03 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks resource profile 0 22/04/13 17:57:03 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2) (5bcbeccf03ef, executor driver, partition 0, NODE_LOCAL, 4453 bytes) taskResourceAssignments Map() 22/04/13 17:57:03 INFO Executor: Running task 0.0 in stage 2.0 (TID 2) 22/04/13 17:57:04 INFO ShuffleBlockFetcherIterator: Getting 1 (60.0 B) non-empty blocks including 1 (60.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks 22/04/13 17:57:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 8 ms 22/04/13 17:57:04 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2648 bytes result sent to driver 22/04/13 17:57:04 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 66 ms on 5bcbeccf03ef (executor driver) (1/1) 22/04/13 17:57:04 INFO DAGScheduler: ResultStage 2 (count at NativeMethodAccessorImpl.java:0) finished in 0.079 s 22/04/13 17:57:04 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job 22/04/13 17:57:04 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 22/04/13 17:57:04 INFO TaskSchedulerImpl: Killing all running tasks in stage 2: Stage finished 22/04/13 17:57:04 INFO DAGScheduler: Job 1 finished: count at NativeMethodAccessorImpl.java:0, took 0.426925 s WARNING:root:The data dataframe count after filter on timestamp: 646 22/04/13 17:57:04 INFO MongoRelation: requiredColumns: CDC_TIMESTAMP, ENV, PK_ID, PROFILE, VAL0000, VAL0015, VAL0030, VAL0045, VAL0100, VAL0115, VAL0130, VAL0145, VAL0200, VAL0215, VAL0230, VAL0245, VAL0300, VAL0315, VAL0330, VAL0345, VAL0400, VAL0415, VAL0430, VAL0445, VAL0500, VAL0515, VAL0530, VAL0545, VAL0600, VAL0615, VAL0630, VAL0645, VAL0700, VAL0715, VAL0730, VAL0745, VAL0800, VAL0815, VAL0830, VAL0845, VAL0900, VAL0915, VAL0930, VAL0945, VAL1000, VAL1015, VAL1030, VAL1045, VAL1100, VAL1115, VAL1130, VAL1145, VAL1200, VAL1215, VAL1230, VAL1245, VAL1300, VAL1315, VAL1330, VAL1345, VAL1400, VAL1415, VAL1430, VAL1445, VAL1500, VAL1515, VAL1530, VAL1545, VAL1600, VAL1615, VAL1630, VAL1645, VAL1700, VAL1715, VAL1730, VAL1745, VAL1800, VAL1815, VAL1830, VAL1845, VAL1900, VAL1915, VAL1930, VAL1945, VAL2000, VAL2015, VAL2030, VAL2045, VAL2100, VAL2115, VAL2130, VAL2145, VAL2200, VAL2215, VAL2230, VAL2245, VAL2300, VAL2315, VAL2330, VAL2345, VALUEDAY, _insertedTS, _modifiedTS, filters: IsNotNull(CDC_TIMESTAMP), GreaterThanOrEqual(CDC_TIMESTAMP,1649841014867) 22/04/13 17:57:04 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. 22/04/13 17:57:04 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 22/04/13 17:57:04 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 22/04/13 17:57:04 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 22/04/13 17:57:04 INFO SparkContext: Starting job: csv at NativeMethodAccessorImpl.java:0 22/04/13 17:57:04 INFO DAGScheduler: Got job 2 (csv at NativeMethodAccessorImpl.java:0) with 1 output partitions 22/04/13 17:57:04 INFO DAGScheduler: Final stage: ResultStage 3 (csv at NativeMethodAccessorImpl.java:0) 22/04/13 17:57:04 INFO DAGScheduler: Parents of final stage: List() 22/04/13 17:57:04 INFO DAGScheduler: Missing parents: List() 22/04/13 17:57:04 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[17] at csv at NativeMethodAccessorImpl.java:0), which has no missing parents 22/04/13 17:57:04 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 295.7 KiB, free 434.1 MiB) 22/04/13 17:57:04 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 82.0 KiB, free 434.0 MiB) 22/04/13 17:57:04 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 5bcbeccf03ef:42707 (size: 82.0 KiB, free: 434.3 MiB) 22/04/13 17:57:04 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1388 22/04/13 17:57:04 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[17] at csv at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0)) 22/04/13 17:57:04 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks resource profile 0 22/04/13 17:57:04 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3) (5bcbeccf03ef, executor driver, partition 0, ANY, 4610 bytes) taskResourceAssignments Map() 22/04/13 17:57:04 INFO Executor: Running task 0.0 in stage 3.0 (TID 3) 22/04/13 17:57:04 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 5bcbeccf03ef:42707 in memory (size: 5.0 KiB, free: 434.3 MiB) 22/04/13 17:57:04 INFO CodeGenerator: Code generated in 193.118221 ms 22/04/13 17:57:04 INFO CodeGenerator: Code generated in 7.625605 ms 22/04/13 17:57:04 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 22/04/13 17:57:04 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 22/04/13 17:57:04 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 5bcbeccf03ef:42707 in memory (size: 9.8 KiB, free: 434.3 MiB) 22/04/13 17:57:04 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 22/04/13 17:57:05 INFO CodeGenerator: Code generated in 379.158446 ms0 + 1) / 1] 22/04/13 17:57:06 INFO FileOutputCommitter: Saved output of task 'attempt_202204131757047294744952994687634_0003_m_000000_3' to file:/opt/prd_data_to_validate.csv/_temporary/0/task_202204131757047294744952994687634_0003_m_000000 22/04/13 17:57:06 INFO SparkHadoopMapRedUtil: attempt_202204131757047294744952994687634_0003_m_000000_3: Committed 22/04/13 17:57:06 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 2429 bytes result sent to driver 22/04/13 17:57:06 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 1622 ms on 5bcbeccf03ef (executor driver) (1/1) 22/04/13 17:57:06 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 22/04/13 17:57:06 INFO DAGScheduler: ResultStage 3 (csv at NativeMethodAccessorImpl.java:0) finished in 1.665 s 22/04/13 17:57:06 INFO DAGScheduler: Job 2 is finished. Cancelling potential speculative or zombie tasks for this job 22/04/13 17:57:06 INFO TaskSchedulerImpl: Killing all running tasks in stage 3: Stage finished 22/04/13 17:57:06 INFO DAGScheduler: Job 2 finished: csv at NativeMethodAccessorImpl.java:0, took 1.671971 s 22/04/13 17:57:06 INFO FileFormatWriter: Write Job 2f374af1-3b72-4922-8dbf-2aa9dba72be9 committed. 22/04/13 17:57:06 INFO FileFormatWriter: Finished processing stats for write job 2f374af1-3b72-4922-8dbf-2aa9dba72be9. WARNING:root:Done WARNING:root:Using filer clause: PK_ID == "04000000000000000000120220402" 22/04/13 17:57:06 INFO MongoRelation: requiredColumns: CDC_TIMESTAMP, ENV, PK_ID, PROFILE, VAL0000, VAL0015, VAL0030, VAL0045, VAL0100, VAL0115, VAL0130, VAL0145, VAL0200, VAL0215, VAL0230, VAL0245, VAL0300, VAL0315, VAL0330, VAL0345, VAL0400, VAL0415, VAL0430, VAL0445, VAL0500, VAL0515, VAL0530, VAL0545, VAL0600, VAL0615, VAL0630, VAL0645, VAL0700, VAL0715, VAL0730, VAL0745, VAL0800, VAL0815, VAL0830, VAL0845, VAL0900, VAL0915, VAL0930, VAL0945, VAL1000, VAL1015, VAL1030, VAL1045, VAL1100, VAL1115, VAL1130, VAL1145, VAL1200, VAL1215, VAL1230, VAL1245, VAL1300, VAL1315, VAL1330, VAL1345, VAL1400, VAL1415, VAL1430, VAL1445, VAL1500, VAL1515, VAL1530, VAL1545, VAL1600, VAL1615, VAL1630, VAL1645, VAL1700, VAL1715, VAL1730, VAL1745, VAL1800, VAL1815, VAL1830, VAL1845, VAL1900, VAL1915, VAL1930, VAL1945, VAL2000, VAL2015, VAL2030, VAL2045, VAL2100, VAL2115, VAL2130, VAL2145, VAL2200, VAL2215, VAL2230, VAL2245, VAL2300, VAL2315, VAL2330, VAL2345, VALUEDAY, _id, _insertedTS, _modifiedTS, filters: IsNotNull(PK_ID), EqualTo(PK_ID,04000000000000000000120220402) 22/04/13 17:57:06 INFO SparkContext: Starting job: toPandas at num_conversion.py:96 22/04/13 17:57:06 INFO DAGScheduler: Got job 3 (toPandas at num_conversion.py:96) with 1 output partitions 22/04/13 17:57:06 INFO DAGScheduler: Final stage: ResultStage 4 (toPandas at num_conversion.py:96) 22/04/13 17:57:06 INFO DAGScheduler: Parents of final stage: List() 22/04/13 17:57:06 INFO DAGScheduler: Missing parents: List() 22/04/13 17:57:06 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[25] at toPandas at num_conversion.py:96), which has no missing parents 22/04/13 17:57:06 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 129.1 KiB, free 433.9 MiB) 22/04/13 17:57:06 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 23.3 KiB, free 433.9 MiB) 22/04/13 17:57:06 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 5bcbeccf03ef:42707 (size: 23.3 KiB, free: 434.3 MiB) 22/04/13 17:57:06 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1388 22/04/13 17:57:06 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[25] at toPandas at num_conversion.py:96) (first 15 tasks are for partitions Vector(0)) 22/04/13 17:57:06 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks resource profile 0 22/04/13 17:57:06 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4) (5bcbeccf03ef, executor driver, partition 0, ANY, 4610 bytes) taskResourceAssignments Map() 22/04/13 17:57:06 INFO Executor: Running task 0.0 in stage 4.0 (TID 4) 22/04/13 17:57:06 INFO CodeGenerator: Code generated in 48.651706 ms 22/04/13 17:57:06 INFO CodeGenerator: Code generated in 5.826723 ms 22/04/13 17:57:06 INFO CodeGenerator: Code generated in 226.212641 ms 22/04/13 17:57:06 INFO Executor: Finished task 0.0 in stage 4.0 (TID 4). 1563 bytes result sent to driver 22/04/13 17:57:06 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 500 ms on 5bcbeccf03ef (executor driver) (1/1) 22/04/13 17:57:06 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 22/04/13 17:57:06 INFO DAGScheduler: ResultStage 4 (toPandas at num_conversion.py:96) finished in 0.525 s 22/04/13 17:57:06 INFO DAGScheduler: Job 3 is finished. Cancelling potential speculative or zombie tasks for this job 22/04/13 17:57:06 INFO TaskSchedulerImpl: Killing all running tasks in stage 4: Stage finished 22/04/13 17:57:06 INFO DAGScheduler: Job 3 finished: toPandas at num_conversion.py:96, took 0.531992 s /usr/local/lib/python3.7/site-packages/pyspark/sql/pandas/conversion.py:186: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df[column_name] = series +----+-----------------+---------+-------------------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+--------------------------------------------------------------+----------------------------+----------------------------+ | | CDC_TIMESTAMP | ENV | PK_ID | PROFILE | VAL0000 | VAL0015 | VAL0030 | VAL0045 | VAL0100 | VAL0115 | VAL0130 | VAL0145 | VAL0200 | VAL0215 | VAL0230 | VAL0245 | VAL0300 | VAL0315 | VAL0330 | VAL0345 | VAL0400 | VAL0415 | VAL0430 | VAL0445 | VAL0500 | VAL0515 | VAL0530 | VAL0545 | VAL0600 | VAL0615 | VAL0630 | VAL0645 | VAL0700 | VAL0715 | VAL0730 | VAL0745 | VAL0800 | VAL0815 | VAL0830 | VAL0845 | VAL0900 | VAL0915 | VAL0930 | VAL0945 | VAL1000 | VAL1015 | VAL1030 | VAL1045 | VAL1100 | VAL1115 | VAL1130 | VAL1145 | VAL1200 | VAL1215 | VAL1230 | VAL1245 | VAL1300 | VAL1315 | VAL1330 | VAL1345 | VAL1400 | VAL1415 | VAL1430 | VAL1445 | VAL1500 | VAL1515 | VAL1530 | VAL1545 | VAL1600 | VAL1615 | VAL1630 | VAL1645 | VAL1700 | VAL1715 | VAL1730 | VAL1745 | VAL1800 | VAL1815 | VAL1830 | VAL1845 | VAL1900 | VAL1915 | VAL1930 | VAL1945 | VAL2000 | VAL2015 | VAL2030 | VAL2045 | VAL2100 | VAL2115 | VAL2130 | VAL2145 | VAL2200 | VAL2215 | VAL2230 | VAL2245 | VAL2300 | VAL2315 | VAL2330 | VAL2345 | VALUEDAY | _id | _insertedTS | _modifiedTS | |----+-----------------+---------+-------------------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+--------------------------------------------------------------+----------------------------+----------------------------| | 0 | 1649841751350 | 040 | 04000000000000000000120220402 | 000000000000000001 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 20220402 | Row(PK_ID='04000000000000000000120220402', oid=None) | 2022-04-04 06:16:09.979000 | 2022-04-13 07:22:34.990000 | +----+-----------------+---------+-------------------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+--------------------------------------------------------------+----------------------------+----------------------------+ 22/04/13 17:57:07 INFO MongoClientCache: Closing MongoClient: [ip-172-19-46-161.eu-central-1.compute.internal:27017,ip-172-19-33-146.eu-central-1.compute.internal:27017] 22/04/13 17:57:07 INFO connection: Closed connection [connectionId{localValue:5, serverValue:199559}] to ip-172-19-33-146.eu-central-1.compute.internal:27017 because the pool has been closed. 22/04/13 17:57:07 INFO SparkContext: Invoking stop() from shutdown hook bash-4.2# 22/04/13 17:57:07 INFO SparkUI: Stopped Spark web UI at http://5bcbeccf03ef:4040 22/04/13 17:57:07 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 22/04/13 17:57:07 INFO MemoryStore: MemoryStore cleared 22/04/13 17:57:07 INFO BlockManager: BlockManager stopped 22/04/13 17:57:07 INFO BlockManagerMaster: BlockManagerMaster stopped 22/04/13 17:57:07 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/04/13 17:57:07 INFO SparkContext: Successfully stopped SparkContext 22/04/13 17:57:07 INFO ShutdownHookManager: Shutdown hook called 22/04/13 17:57:07 INFO ShutdownHookManager: Deleting directory /tmp/spark-72f58a2e-1c1f-41be-9f56-62caad65fa4b 22/04/13 17:57:07 INFO ShutdownHookManager: Deleting directory /tmp/spark-6735e371-7a63-45e2-8a69-3ed40007be58/pyspark-af0c2558-2a00-4355-a2ca-68c8bcdedb2c 22/04/13 17:57:07 INFO ShutdownHookManager: Deleting directory /tmp/spark-6735e371-7a63-45e2-8a69-3ed40007be58