You will design and build scalable data infrastructure to handle complex data workflows.
Responsibilities
Define data requirements, gather, and wrangle large volumes of structured and unstructured data, validating data using various tools in the Data Environment.
Develop mechanisms to ingest, analyze, validate, normalize, and clean data, supporting ad-hoc analysis and standardization.
Create data policies and develop interfaces and retention models, including synthesizing or anonymizing data.
Implement statistical data quality procedures on new data sources to support Data Scientists and insight creation.
Build and maintain data pipelines that clean, transform, and aggregate data from disparate sources.
Required Skills
7+ years of overall IT experience.
5+ years in a data engineering/ETL role manipulating and processing large datasets.
3+ years with Big Data tools like Hadoop, Spark, Spark SQL, Kafka, Sqoop, Hive, S3, HDFS, or Cloud platforms (AWS, GCP).
3+ years building, testing, and optimizing data ingestion pipelines using Tibco, IBM, or similar technologies.
Experience with Databricks UI, Managing Databricks Notebooks, Delta Lake (Python/Spark SQL), Delta Live Tables, and Unity Catalog.
High-velocity, high-volume stream processing using Apache Kafka and Spark Streaming.