Spark - SW Developer

[Spark] Globalization을 위한 Timezone 설정, TIMESTAMP_NTZ

G11N / 파일 포맷(Parquet, Avro, ORC)에 따른 timestamp 처리시 이슈 / iceberg 에서 timestamp 이슈

Posted by Wonyong Jang on March 15, 2025 · 9 mins read

[Spark] Spark에서 테이블 저장

save, saveAsTable 비교 / writeTo, insertInto

Posted by Wonyong Jang on January 24, 2025 · 3 mins read

[Spark] Spark에서 Iceberg 테이블 다루기

테이블 생성(create) / partitionOverwriteMode, storeAssignmentPolicy(ANSI, legacy, strict) / insert overwrite 와 merge into

Posted by Wonyong Jang on October 09, 2024 · 10 mins read

[Spark] PySpark 개발환경 구성과 주요기능

scala 와 python 을 이용한 Spark 비교 / Temp View / Python Package Management / spark-submit 옵션

Posted by Wonyong Jang on August 08, 2024 · 11 mins read

[Spark] Dynamic Partition Pruning / Speculative Execution

filter push down / dimension 테이블과 fact 테이블 조인시 쿼리 성능 최적화

Posted by Wonyong Jang on May 15, 2024 · 5 mins read

[Spark] Join Strategies 과 Shuffle

shuffle join, broadcast join / shuffle sort merge join, broadcast hash join / join hint

Posted by Wonyong Jang on April 20, 2024 · 13 mins read

[Spark] Adaptive Query Execution

Broadcast Hash Join / coalescing shuffle partitons, switching join strategies, optimizing skew joins

Posted by Wonyong Jang on April 15, 2024 · 12 mins read

[Spark] Data Skew 해결을 위한 Salting 기법

데이터를 Even 하게 분산시키기

Posted by Wonyong Jang on April 10, 2024 · 6 mins read

[Spark] On Kubernetes

EMR Cluster 에서의 Spark와 비교 / EKS(Elastic Kubernetes Service)

Posted by Wonyong Jang on March 03, 2024 · 4 mins read

[Spark] Memory 관리 및 튜닝

Spark 실행시 적절한 Driver와 Executor 개수 / on-heap, off-heap, overHead memory /PySpark에서의 Memory

Posted by Wonyong Jang on February 13, 2024 · 13 mins read

[Spark] Log4j를 이용한 Log Rolling(RollingFileAppender)

Custom Log4j 사용하기 / Long Running Spark Streaming 에서 Log Rolling

Posted by Wonyong Jang on November 19, 2023 · 3 mins read

[Spark] 테스트 코드 작성하기

scalatest, spark-testing-base 라이브러리를 이용한 단위 테스트(rdd, dataFrame, dataSet)

Posted by Wonyong Jang on September 29, 2023 · 7 mins read

[Spark] Spark streaming processing delay (Incident Review)

Monitor Spark streaming applications on Amazon EMR / StreamingListener

Posted by Wonyong Jang on July 09, 2023 · 14 mins read

[Spark ML] LightGBM 알고리즘 Spark로 구현하기

주요 하이퍼 파라미터 / 조기 중단 기능(Early Stopping)

Posted by Wonyong Jang on April 08, 2023 · 4 mins read

[Spark ML] Spark ML 결정 트리

DecisionTreeClassifier, RandomForestClassifier, GBTClassifier / MulticlassClassificationEvaluator, BinaryClassificationEvaluator

Posted by Wonyong Jang on April 05, 2023 · 9 mins read

[Spark ML] Spark ML 데이터 전처리

Label Encoding(StringIndexer, IndexToString) / OneHotEncoderEstimator / Scaling(StandardScaler, MinMaxScaler)

Posted by Wonyong Jang on April 02, 2023 · 19 mins read

[Spark ML] Spark ML로 iris 붓꽃 데이터 예측 모델 만들기

randomSplit, vectorAssembler, pipeline / crossValidator, trainValidationSplit 교차검증 및 하이퍼 파라미터 튜닝

Posted by Wonyong Jang on April 01, 2023 · 13 mins read

[Spark] 설치 및 실습 환경 구성하기

scala언어의 spark prompt를 실행하는 script / docker 를 이용한 spark 실행 / databricks 플랫폼 community edition

Posted by Wonyong Jang on January 27, 2023 · 8 mins read

[Spark] Structured Streaming 전환 하기

Migration Spark Streaming to Structured Streaming / Structured Streaming 과 Kinesis 연동 / checkpoint와 initialPosition / TroubleShooting

Posted by Wonyong Jang on March 07, 2022 · 15 mins read

[Spark] Structured Streaming 으로 Word Count 구현하기

append 모드와 update 모드의 watermarking / late data에 대한 handling

Posted by Wonyong Jang on January 07, 2022 · 7 mins read

[Spark] Structured Streaming Fault Tolerance

Planner, Source, State, Sink

Posted by Wonyong Jang on January 05, 2022 · 3 mins read

[Spark] Structured Streaming 이란

Spark Streaming 비교 및 사용시 문제점 / Unbounded Table / OutputMode(Complete, Update, Append)

Posted by Wonyong Jang on January 03, 2022 · 7 mins read

[Spark] Docker Ubuntu 컨테이너로 Spark 실습환경 만들기

도커를 이용한 master, worker 클러스터 환경 구성 / spark-submit / 스탠드 얼론 클러스터 매니저

Posted by Wonyong Jang on August 29, 2021 · 14 mins read

[Spark] Broadcast, Accumulator 공유변수

broadcast, accumulator, closure

Posted by Wonyong Jang on July 08, 2021 · 4 mins read

[Spark] How to override a spark dependency in cluster mode(AWS EMR)

라이브러리 버전 충돌이 발생할 때 shadowJar를 사용하여 package relocate

Posted by Wonyong Jang on July 08, 2021 · 5 mins read

[Spark] Dynamic Resource Allocation in AWS EMR Cluster

Spark (Streaming) Dynamic Allocation / External Shuffle Service / ec2기반 aws emr auto scaling 트러블 슈팅

Posted by Wonyong Jang on June 25, 2021 · 13 mins read

[Spark] Persistence 와 Data Locality

RDD Persistence / memory, disk cache / locality level(PROCESS LOCAL, NODE LOCAL, RACK LOCAL)

Posted by Wonyong Jang on June 23, 2021 · 11 mins read

[Spark] 아파치 스파크 Partitioning

RDD on a Cluster / Partiton 개수와 크기 정하기 / coalesce 와 repartition / spark.sql.files.maxPartitionBytes

Posted by Wonyong Jang on June 21, 2021 · 9 mins read

[Spark] 아파치 스파크 Serialization

Serialization challenges with Spark and Scala / Passing function to spark

Posted by Wonyong Jang on June 15, 2021 · 16 mins read

[Spark] Pipeline and Stage

Stage skip 되는 경우 / 셔플에 의한 stage 분리 / 셔플 발생시 write, read

Posted by Wonyong Jang on May 10, 2021 · 6 mins read

[Spark] 아파치 스파크(spark) DataSet

DataSet 의 주요 연산 사용법 / Encoder

Posted by Wonyong Jang on May 07, 2021 · 11 mins read

[Spark] 아파치 스파크(spark) SQL 의 Tungsten Project

Spark SQL 사용시 하드웨어(cpu, memory 등) 최적화 제공

Posted by Wonyong Jang on May 04, 2021 · 10 mins read

[Spark] 아파치 스파크(spark) SQL 의 Catalyst Optimizer

Spark SQL 사용시 엔진 차원에서 성능 최적화 / Optimized Query Plan

Posted by Wonyong Jang on May 03, 2021 · 7 mins read

[Spark] 아파치 스파크(spark) DataFrame 구현하기

DataFrame 주요 연산 / groupBy / UDF(User Define Function) / join

Posted by Wonyong Jang on May 02, 2021 · 16 mins read

[Spark] 아파치 스파크(spark) SQL과 DataFrame

RDD vs DataFrame / Catalyst Optimizer / Tungsten execution engine / Encoder

Posted by Wonyong Jang on May 01, 2021 · 5 mins read

[Spark] Streaming Graceful Shutdown

How to do graceful shutdown of spark streaming job / sigkill, sigterm, sigint 차이

Posted by Wonyong Jang on April 19, 2021 · 9 mins read

[Spark] (Structured) Streaming Checkpointing

Spark Streaming과 Structured Streaming Checkpoint, S3 를 Checkpoint 로 사용하여 구현(aws credentials) / S3A 와 EMRFS

Posted by Wonyong Jang on April 17, 2021 · 10 mins read

[Spark] Streaming Data Sources

Kafka, Kinesis / Direct, Receiver Based Data Sources, Fault Tolerance / Backpressure

Posted by Wonyong Jang on April 15, 2021 · 7 mins read

[Spark] Streaming 의 Fault Tolerance 와 Graph

장애 복구 / Dstream의 Graph / Receiver, Network Input Tracker, Job Scheduler, Job Manager, Block manager

Posted by Wonyong Jang on April 14, 2021 · 8 mins read

[Spark] Streaming 의 DStream과 주요 연산

DStream(Discretized Streams) / stateful(window, state) / transform / Receiver Input Stream / Block Manager

Posted by Wonyong Jang on April 12, 2021 · 18 mins read

[Spark] 아파치 스파크(spark) RDD 여러가지 연산

지연 처리 방식의 Transformation, 즉시 실행 방식의 Action / narrow, wide transformation

Posted by Wonyong Jang on April 12, 2021 · 22 mins read

[Spark] 아파치 스파크(spark) 시작하기

Driver, Executor, Node, Job, Stage, Task, Cluster Manager/ RDD, Fault tolerance / Hadoop

Posted by Wonyong Jang on April 11, 2021 · 16 mins read