<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Arun Shanmugam Kumar - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Arun Shanmugam Kumar - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 22:23:57 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/arunshanmugamkumar/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Read and Write Deeply Partitioned Files Using Apache Spark ]]>
                </title>
                <description>
                    <![CDATA[ If you use Apache Spark to write your data pipeline, you might need to export or copy data from a source to destination while preserving the partition folders between the source and destination. When I researched online on how to do this in Spark, I ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-read-and-write-deeply-partitioned-files-using-apache-spark/</link>
                <guid isPermaLink="false">68b4bd4bbcba09cb447c5a9f</guid>
                
                    <category>
                        <![CDATA[ #apache-spark ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Scala ]]>
                    </category>
                
                    <category>
                        <![CDATA[ big data ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Arun Shanmugam Kumar ]]>
                </dc:creator>
                <pubDate>Sun, 31 Aug 2025 21:23:23 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756671369152/1e620925-6fb6-47fa-8344-86b9d3c7cd02.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you use Apache Spark to write your data pipeline, you might need to export or copy data from a source to destination while preserving the partition folders between the source and destination.</p>
<p>When I researched online on how to do this in Spark, I found very few tutorials giving an end-to-end solution that worked – especially when the partitions are deeply nested and you don't know beforehand the values these folder names will take (for example <code>year=*/month=*/day=*/hour=*/*.csv</code>).</p>
<p>In this tutorial, I have provided one such implementation using Spark.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisite">Prerequisite</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-setup">Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-false-starts">False Starts</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-my-solution">My Solution</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisite">Prerequisite</h2>
<p>To follow along in this tutorial, you need to have basic understanding of distributed computing using frameworks like Hadoop and Spark, as well as code that’s programmed in Object Oriented languages like Scala/Java. The code is tested using the below dependencies:</p>
<ul>
<li><p>Scala 2.12+</p>
</li>
<li><p>Java 17 (earlier versions might work)</p>
</li>
<li><p>Sbt</p>
</li>
</ul>
<h2 id="heading-setup">Setup</h2>
<p>I’m assuming you have partition folders that are created at the source with the below pattern (which is a standard partition column involving date-time):</p>
<p><code>year/month/day/hour</code></p>
<p>Crucially, as I mentioned above, I’m assuming that you don’t know the full name of the folders – except that they have some constant prefix pattern in them.</p>
<h2 id="heading-false-starts">False Starts</h2>
<ol>
<li><p>If you think of using <code>recursiveFileLookup</code> and <code>pathGlobFilter</code> option while both reading and writing, it doesn’t quite work, as the above functions are only available on read API.</p>
</li>
<li><p>If you think of parameterizing the reading and writing based on all the possible year/month/day/hour combination and skip export if the corresponding partition folder is not found, then it might work but won’t be very efficient.</p>
</li>
</ol>
<h2 id="heading-my-solution">My Solution</h2>
<p>After a few trials and errors and searching in Stack Overflow and the Spark documentation, I hit upon an idea to use a combination of <code>input_file_name()</code>, <code>regexp_extract()</code>, and <code>partitionBy()</code> API's on the write side to achieve the end goal. You can find a Scala-based sample code below:</p>
<pre><code class="lang-scala"><span class="hljs-keyword">package</span> main.scala.blog

<span class="hljs-comment">/**
*  Spark stream example code to read and write from a partitioned folder
*  to a partitioned folder without explicitly known datetime.
*/</span>

<span class="hljs-keyword">import</span> org.apache.spark.sql.<span class="hljs-type">SparkSession</span>
<span class="hljs-keyword">import</span> org.apache.spark.sql.types.<span class="hljs-type">StringType</span>
<span class="hljs-keyword">import</span> org.apache.spark.sql.functions.{udf, input_file_name, col, lit, regexp_extract}

<span class="hljs-class"><span class="hljs-keyword">object</span> <span class="hljs-title">PartitionedReaderWriter</span> </span>{

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span></span>(args: <span class="hljs-type">Array</span>[<span class="hljs-type">String</span>]) {
        <span class="hljs-comment">// 1.</span>
        <span class="hljs-keyword">val</span> spark = <span class="hljs-type">SparkSession</span>
                    .builder
                    .appName(<span class="hljs-string">"PartitionedReaderWriterApp"</span>)
                    .getOrCreate()

        <span class="hljs-keyword">val</span> sourceBasePath = <span class="hljs-string">"data/partitioned_files_source/user"</span>
        <span class="hljs-comment">// 2.</span>
        <span class="hljs-keyword">val</span> sourceDf = spark.read
                            .format(<span class="hljs-string">"csv"</span>)
                            .schema(<span class="hljs-string">"State STRING, Color STRING, Count INT"</span>)
                            .option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>)
                            .option(<span class="hljs-string">"pathGlobFilter"</span>, <span class="hljs-string">"*.csv"</span>)
                            .option(<span class="hljs-string">"recursiveFileLookup"</span>, <span class="hljs-string">"true"</span>)
                            .load(sourceBasePath)

        <span class="hljs-keyword">val</span> destinationBasePath = <span class="hljs-string">"data/partitioned_files_destination/user"</span>
        <span class="hljs-comment">// 3.</span>
        <span class="hljs-keyword">val</span> writeDf = sourceDf
                        .withColumn(<span class="hljs-string">"year"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"year=(\\d{4})"</span>, <span class="hljs-number">1</span>))
                        .withColumn(<span class="hljs-string">"month"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"month=(\\d{2})"</span>, <span class="hljs-number">1</span>))
                        .withColumn(<span class="hljs-string">"day"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"day=(\\d{2})"</span>, <span class="hljs-number">1</span>))
                        .withColumn(<span class="hljs-string">"hour"</span>, regexp_extract(input_file_name(), <span class="hljs-string">"hour=(\\d{2})"</span>, <span class="hljs-number">1</span>))

        <span class="hljs-comment">// 4.</span>
        writeDf.write
                .format(<span class="hljs-string">"csv"</span>)
                .option(<span class="hljs-string">"header"</span>, <span class="hljs-string">"true"</span>)
                .mode(<span class="hljs-string">"overwrite"</span>)
                .partitionBy(<span class="hljs-string">"year"</span>, <span class="hljs-string">"month"</span>, <span class="hljs-string">"day"</span>, <span class="hljs-string">"hour"</span>)
                .save(destinationBasePath)

        <span class="hljs-comment">// 5.</span>
        spark.stop()        
    }
}
</code></pre>
<p>Here’s what’s going on in the above code:</p>
<ol>
<li><p>Inside main method, you begin by adding Spark initialization setup code to create a Spark session.</p>
</li>
<li><p>You read the data from <code>sourceBasePath</code> using spark <code>read()</code> API with the format as <code>csv</code> (you can also optionally provide the schema). Options <code>recursiveFileLookup</code> and <code>pathGlobFilter</code> are needed to recursively read through nested folders and to specify any <code>csv</code> file, respectively.</p>
</li>
<li><p>Th next section contains the core logic where you can use <code>input_file_name()</code> to return the full path of the file and <code>regexp_extract()</code> to extract <code>year</code> , <code>month</code>, <code>day</code>, and <code>hour</code> from the corresponding subfolders in the path and store them as auxiliary columns on the dataframe.</p>
</li>
<li><p>Finally, you write the dataframe using the <code>csv</code> format again and crucially use <code>partitionBy</code> to specify the previously created auxiliary columns as partition columns. Then save the dataframe in the <code>destinationBasePath</code>.</p>
</li>
<li><p>After the copy is done, you stop the Spark session by calling the <code>stop()</code> API.</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article I have shown you how to export / copy a deeply nested data files from source to destination using Apache Spark in an efficient way. I hope you find it useful!</p>
<p>You can read my other articles at <a target="_blank" href="https://www.beyonddream.me/post-1/">https://www.beyonddream.me.</a></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
