<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ academia - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ academia - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sat, 23 May 2026 22:21:07 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/academia/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ What is a CV? CV vs Résumé + Curriculum Vitae Meaning ]]>
                </title>
                <description>
                    <![CDATA[ Depending on where you live and the field you're in, you've probably heard the terms "résumé" and "curriculum vitae" or "CV". And you might be wondering – are they the same thing? Are these terms interchangeable? Well, the answer isn't a simple yes o... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/what-is-a-cv-and-how-is-it-different-from-a-resume/</link>
                <guid isPermaLink="false">66b1fa8709c44225ad2c3915</guid>
                
                    <category>
                        <![CDATA[ academia ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Job Hunting ]]>
                    </category>
                
                    <category>
                        <![CDATA[ jobs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ resume ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Abigail Rennemeyer ]]>
                </dc:creator>
                <pubDate>Mon, 19 Apr 2021 05:41:05 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/606e1294d5756f080ba961c8.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Depending on where you live and the field you're in, you've probably heard the terms "résumé" and "curriculum vitae" or "CV". And you might be wondering – are they the same thing? Are these terms interchangeable?</p>
<p>Well, the answer isn't a simple yes or no. Turns out, it basically depends on whether you're in academia or not, and possibly where you live. But more on that below.</p>
<p>If you're job hunting, or just want to keep your credentials up to date, you'll want to make sure you have a résumé or a CV on hand.</p>
<p>Let's look at each document in detail. In this article you'll learn what a CV is, how it differs from a résumé (and when that distinction matters), and when you might need each one.</p>
<h2 id="heading-what-is-a-cv">What is a CV?</h2>
<p>A CV, or curriculum vitae, actually has two meanings, depending on the field you're in.</p>
<p>But first, what does the Latin "curriculum vitae" actually mean? Well, it means "the course of (one's) life". Which makes it sound like quite an epic document, depending on how much life experience you've had.</p>
<h3 id="heading-cvs-in-academia">CVs in Academia</h3>
<p>If you're in academia and/or are applying to an academic position, this makes sense. A CV in this case refers to a detailed document that explains your educational and professional background, any publications you have, research you've done and so on – in great depth. </p>
<p>You'd also use this type of CV if you're applying for large grants or fellowships, for certain jobs in medical and scientific fields, and so on.</p>
<h3 id="heading-cvs-in-industry-jobs">CVs in Industry Jobs</h3>
<p>On the other hand, in both British and American English, the term CV can be used to reference a short document that catalogues your education, career history, and skills. It's usually no more than a page (front and back at the most) and provides the most important highlights you want your potential employer to know.</p>
<p>Basically, in this case, a CV is what you'd send to a company for whom you want to work as a data scientist, programmer, business development lead, and other jobs like those ("industry" jobs). It would be the first thing the employer likely sees when considering your application, and they'd probably spend about 6 seconds reviewing it.</p>
<p>So, just to summarize:</p>
<ul>
<li>In academia, a CV refers to an in-depth personal and professional life summary that includes education, career history, publications, and other professional achievements and awards.</li>
<li>In other industries – like tech or business – the term CV refers to the short education, career, and skills summary you submit with job applications.</li>
</ul>
<h2 id="heading-cv-vs-resume-what-are-the-main-differences">CV vs Résumé – What Are the Main Differences?</h2>
<p>The shorter CV might sound familiar – and that's because it's basically interchangeable with a résumé. In the United States and elsewhere, you can use both terms (CV and résumé) to refer to the shorter document you submit with job applications.</p>
<p>So what are the main differences between academic CVs and traditional résumés? Let's take a look at the primary components of each so we can better distinguish between the two documents.</p>
<h3 id="heading-what-to-include-in-an-academic-cv">What to include in an academic CV</h3>
<p>As we learned above, a CV intended for the academic world includes more detail and generally more information than a résumé. Generally, you'll want to have sections for:</p>
<ul>
<li>Your professional qualifications – any certifications you might have</li>
<li>Your educational background – your degree(s), any theses you've written, other courses you've taken</li>
<li>Your work experience – jobs you've had, projects you've worked on, internships you've held, teaching positions you've had, research you've conducted</li>
<li>Your accomplishments – any awards or honors you've received, fellowships or grants you've been awarded, books or papers you've written</li>
<li>Your activities – you can include things like volunteer work, serious hobbies, side projects</li>
<li>Any special qualifications you might have</li>
</ul>
<h3 id="heading-what-to-include-in-a-resumeshorter-cv">What to include in a résumé/shorter CV</h3>
<p>You might have heard that recruiters or employers might spend no more than 6 seconds reviewing your résumé – and while that's not always true, you have to imagine it might be.</p>
<p>So your résumé needs to be focused and to the point, and should only highlight your most recent experience and achievements, and your strongest skills. Here's what to include:</p>
<ul>
<li>Your name and contact information – make sure you include an email address, and you can also add your social media handles if you want.</li>
<li>You can include a summary – a couple sentences that gives an overview of your professional experience thus far (a brief "getting to know you" paragraph).</li>
<li>Your educational background – where you got your degree (if you have one) and any post-grad work. If you didn't go to college, you can list any bootcamps or online courses you've taken.</li>
<li>Your work experience – if you've had a number of jobs and have a fair amount of experience, just include the most recent and relevant. If you're new to the job market, include any projects, internships, or other relevant experience.</li>
<li>Your top skills – if you're applying for a job that requires specific skills, and you have those skills, list them. You can also list general skills that would apply to that position.</li>
</ul>
<p>This is the primary info you want to include. Your résumé shouldn't be much longer than a page (maybe two if you've had a lot of experience/jobs), but if you have more room you can include honors and awards and side projects.</p>
<p>So in short, academic CVs are much more in-depth, cover more ground, and provide a more complete picture of your entire professional history.</p>
<p>Shorter CVs/résumés, on the other hand, focus on your relevant education and work experience, and the skills you have that are applicable to the job for which you're applying.</p>
<h2 id="heading-example-of-a-cv">Example of a CV</h2>
<p>Here's an example of a pretty impressive CV. I'll include a screenshot of the first <em>page</em> here, but it's 10 pages long. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/04/cv-example.png" alt="Image" width="600" height="400" loading="lazy">
<em>Thank you to Dr. Tuba Yilmaz Abdolsaheb for <a target="_blank" href="http://tubayilmaz.com/">sharing this example</a>!</em></p>
<h2 id="heading-example-of-a-resume">Example of a Résumé</h2>
<p>And here's an example of a shorter CV/résumé, like what you'd take to an industry job interview. This example is for a data scientist, and the entire thing is one page long.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/04/resume-example.png" alt="Image" width="600" height="400" loading="lazy">
<em>Thanks to <a target="_blank" href="https://www.indeed.com/career-advice/resume-samples/information-technology-resumes/data-scientist">Indeed</a> for the example.</em></p>
<p>And that's it!</p>
<p>Hopefully now you know the differences between an academic CV and a shorter CV or résumé, and will know which one to choose when you're applying for jobs.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Comparing Brazilian and US university theses using natural language processing ]]>
                </title>
                <description>
                    <![CDATA[ By Déborah Mesquita People are more likely to consider a thesis that’s written by a student at a top-ranked University as better than a thesis produced by a student at a University with low (or no) status. But in what way are the works different? Wha... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/comparing-brazilian-and-us-university-theses-using-natural-language-processing-47196a2f9d64/</link>
                <guid isPermaLink="false">66c347aba1d481faeda49b18</guid>
                
                    <category>
                        <![CDATA[ academia ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 22 May 2017 19:25:45 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*D4_EAQTuToB_u4nRFRQ_9A.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Déborah Mesquita</p>
<p>People are more likely to consider a thesis that’s written by a student at a top-ranked University as better than a thesis produced by a student at a University with low (or no) status.</p>
<p>But in what way are the works different? What can the students from non-famous Universities do to produce better work and become more well-known?</p>
<p>I was curious to answer these questions, so I decided to explore <strong>two things</strong> <strong>only</strong>: the themes of the works and their nature. Measuring the quality of a university is something very complex, and is not my goal here. We will analyze a number of Undergraduate theses using natural language processing. We’ll extract keywords using tf-idf and classify the theses using Latent Semantic Indexing (LSI).</p>
<h3 id="heading-the-data">The data</h3>
<p>Our dataset has abstracts of Undergraduate Computer Science Theses from <a target="_blank" href="https://en.wikipedia.org/wiki/Federal_University_of_Pernambuco">Federal University of Pernambuco</a> (UFPE), located in Brazil, and from <a target="_blank" href="https://en.wikipedia.org/wiki/Carnegie_Mellon_University">Carnegie Mellon University</a>, located in the United States. Why Carnegie Mellon? Because it was the only University where I could find a list of theses produced by students who were at the end of their Undergraduate degree program.</p>
<p>The <a target="_blank" href="https://www.timeshighereducation.com/world-university-rankings/2017/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats">Times Higher Education World University Rankings</a> says that Carnegie Mellon has the 6th best Computer Science program, while UFPE is not event in this ranking. Carnegie Mellon ranks 23rd in the World University Ranking, and UFPE is around 801st.</p>
<p>All works were produced between the years of 2002 and 2016. Each thesis has the following information:</p>
<ul>
<li>title of the thesis</li>
<li>abstract of the thesis</li>
<li>year of the thesis</li>
<li>university where the thesis was produced</li>
</ul>
<p>Theses from Carnegie Mellon can be found <a target="_blank" href="https://www.csd.cs.cmu.edu/education/bscs/thesis-topics.html">here</a> and theses from Federal University of Pernambuco can be found <a target="_blank" href="http://cin.ufpe.br/~tg/">here</a>.</p>
<h3 id="heading-step-1-investigating-the-themes-of-the-theses">Step 1 — Investigating the themes of the theses</h3>
<h4 id="heading-extracting-keywords">Extracting keywords</h4>
<p>To get the themes of the thesis, we will use a well known algorithm called tf-idf.</p>
<h4 id="heading-tf-idf">tf-idf</h4>
<p>What tf-idf does is to penalize words that <strong>appear a lot</strong> in a document and at the same time <strong>appear a lot in other documents</strong>. If this happens, the word is not a good pick to characterize this text (as the word could also be used to characterize <em>all</em> the texts). Let’s use <a target="_blank" href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">an example</a> to understand this better. We have two documents:</p>
<p>Document 1:</p>
<pre><code>| Term   | Term Count | |--------|------------| | <span class="hljs-built_in">this</span>   |     <span class="hljs-number">1</span>      | | is     |     <span class="hljs-number">1</span>      | | a      |     <span class="hljs-number">2</span>      | | sample |     <span class="hljs-number">1</span>      |
</code></pre><p>And Document 2:</p>
<pre><code>| Term    | Term Count | |---------|------------| | <span class="hljs-built_in">this</span>    |     <span class="hljs-number">1</span>      | | is      |     <span class="hljs-number">1</span>      | | another |     <span class="hljs-number">2</span>      | | example |     <span class="hljs-number">3</span>      |
</code></pre><p>First let’s see what’s going on. The word <em>this</em> appears 1 time in both documents. This could mean that the word is kind of neutral, right?</p>
<p>On the other hand, the word <em>example</em> appears 3 times in Document 2 and 0 times in Document 1. Interesting.</p>
<p>Now let’s apply some math. We need to compute two things: TF (Term Frequency) and IDF (Inverse Document Frequency).</p>
<p>The equation for TF is:</p>
<pre><code>TF(t) = (<span class="hljs-built_in">Number</span> <span class="hljs-keyword">of</span> times that term t appears <span class="hljs-keyword">in</span> the <span class="hljs-built_in">document</span>) / (Total number <span class="hljs-keyword">of</span> terms <span class="hljs-keyword">in</span> the <span class="hljs-built_in">document</span>)
</code></pre><p>So for terms <em>this</em> and <em>example</em>, we have:</p>
<pre><code>TF(<span class="hljs-string">'this'</span>,   Document <span class="hljs-number">1</span>) = <span class="hljs-number">1</span>/<span class="hljs-number">5</span> = <span class="hljs-number">0.2</span>TF(<span class="hljs-string">'example'</span>,Document <span class="hljs-number">1</span>) = <span class="hljs-number">0</span>/<span class="hljs-number">5</span> = <span class="hljs-number">0</span>
</code></pre><pre><code>TF(<span class="hljs-string">'this'</span>,   Document <span class="hljs-number">2</span>) = <span class="hljs-number">1</span>/<span class="hljs-number">7</span> = <span class="hljs-number">0.14</span>TF(<span class="hljs-string">'example'</span>,Document <span class="hljs-number">2</span>) = <span class="hljs-number">3</span>/<span class="hljs-number">7</span> = <span class="hljs-number">0.43</span>
</code></pre><p>The equation for IDF is:</p>
<pre><code>IDF(t) = log_e(Total number <span class="hljs-keyword">of</span> documents / <span class="hljs-built_in">Number</span> <span class="hljs-keyword">of</span> documents where term t is present)
</code></pre><p>Why do we use a logarithm here? Because tf-idf is is an <a target="_blank" href="https://en.wikipedia.org/wiki/Heuristic">heuristic</a>.</p>
<blockquote>
<p>The intuition was that a query term which occurs in many documents is not a good discriminator, and should be given less weight than one which occurs in few documents, and the measure was an heuristic implementation of this intuition. — <a target="_blank" href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.7340&amp;rep=rep1&amp;type=pdf">Stephen Robertson</a></p>
</blockquote>
<p>As <a target="_blank" href="https://stats.stackexchange.com/users/11852/us%ce%b5r11852">usεr11852</a> explains <a target="_blank" href="https://stats.stackexchange.com/questions/161640/understanding-the-use-of-logarithms-in-the-tf-idf-logarithm">here</a>:</p>
<blockquote>
<p>The aspect emphasised is that the relevance of a term or a document does not increase proportionally with term (or document) frequency. Using a sub-linear function (the logarithm) therefore helps dumped down (sic) this effect. …The influence of very large or very small values (e.g. very rare words) is also amortised. — <a target="_blank" href="https://stats.stackexchange.com/questions/161640/understanding-the-use-of-logarithms-in-the-tf-idf-logarithm">usεr11852</a></p>
</blockquote>
<p>Using the equation for IDF, we have:</p>
<pre><code>IDF(<span class="hljs-string">'this'</span>,   Documents) = log(<span class="hljs-number">2</span>/<span class="hljs-number">2</span>) = <span class="hljs-number">0</span>
</code></pre><pre><code>IDF(<span class="hljs-string">'example'</span>,Documents) = log(<span class="hljs-number">2</span>/<span class="hljs-number">1</span>) = <span class="hljs-number">0.30</span>
</code></pre><p>And finally, the TF-IDF:</p>
<pre><code>TF-IDF(<span class="hljs-string">'this'</span>,   Document <span class="hljs-number">2</span>) = <span class="hljs-number">0.14</span> x <span class="hljs-number">0</span> = <span class="hljs-number">0</span>TF-IDF(<span class="hljs-string">'example'</span>,Document <span class="hljs-number">2</span>) = <span class="hljs-number">0.43</span> x <span class="hljs-number">0.30</span> = <span class="hljs-number">0.13</span>
</code></pre><p>I used the 4 words with highest scores results from the tf-idf algorithm for each thesis. I did this using CountVectorizer and TfidfTransformer from <a target="_blank" href="http://scikit-learn.org/stable/">scikit-learn</a>.</p>
<p>You can see the <strong>Jupyter notebook with the code</strong> <a target="_blank" href="https://github.com/dmesquita/tdcfloripa2017/blob/master/extract_keywords.ipynb">here</a>.</p>
<p>With 4 keywords for each thesis, I used the <a target="_blank" href="https://github.com/amueller/word_cloud">WordCloud</a> library to visualize the words for each University.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/rQQbAtpEZYsk2AQ12rPxbIfIVtPf6hDoVq25" alt="Image" width="738" height="578" loading="lazy">
<em>Keywords for UFPE</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/YFw7RGJdQXHiJMvC6Ya0HRYnZnvn5VUlmRHV" alt="Image" width="738" height="578" loading="lazy">
<em>Keywords for Carnegie Mellon</em></p>
<h3 id="heading-topic-modeling">Topic Modeling</h3>
<p>Another strategy I used to explore the themes from theses of both Universities was topic modeling with <a target="_blank" href="https://en.wikipedia.org/wiki/Latent_semantic_analysis">Latent Semantic Indexing</a> (LSI).</p>
<h4 id="heading-latent-semantic-indexing">Latent Semantic Indexing</h4>
<p>This algorithm gets data from tf-idf and uses matrix decomposition to group documents in topics. We will need some linear algebra to understand this, so let’s start.</p>
<h4 id="heading-singular-value-decomposition-svd">Singular Value Decomposition (SVD)</h4>
<p>First we need to define how to do this matrix decomposition. We will use <a target="_blank" href="https://en.wikipedia.org/wiki/Singular_value_decomposition">Singular Value Decomposition</a> (SVD). Given a matrix <em>M</em> of dimensions <em>m x n</em>, <em>M</em> can be described as:</p>
<pre><code>M = UDV*
</code></pre><p>Where <em>U</em> and <em>V*</em> are <a target="_blank" href="https://en.wikipedia.org/wiki/Orthonormal_basis">orthonormal basis</a> (<em>V*</em> represents the transpose of matrix <em>V</em>). An orthonormal basis is the result if we have two things (normal + orthogonal):</p>
<ul>
<li>when all vectors are of length 1</li>
<li>when all vectors are mutually orthogonal (they make an angle of 90°)</li>
</ul>
<p><em>D</em> is a diagonal matrix (the entries outside the main diagonal are all zero).</p>
<p>To get a sense of how all of this works together we will use the brilliant geometric explanation from <a target="_blank" href="http://www.ams.org/samplings/feature-column/fcarc-svd">this article</a> by David Austing.</p>
<p>Let’s say we have a matrix <em>M</em>:</p>
<pre><code>M = | <span class="hljs-number">3</span> <span class="hljs-number">0</span> |    | <span class="hljs-number">0</span> <span class="hljs-number">1</span> |
</code></pre><p>We can take a point (<em>x</em>,<em>y)</em> in the plane and transforming it into another point using matrix multiplication:</p>
<pre><code>| <span class="hljs-number">3</span> <span class="hljs-number">0</span> |  . | x | = | <span class="hljs-number">3</span>x || <span class="hljs-number">0</span> <span class="hljs-number">1</span> |    | y |   | y  |
</code></pre><p>The effect of this transformation is shown below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/lm4mjFlEkmxbRiGyVkGvNu6ImJkWl-wzuBVA" alt="Image" width="252" height="252" loading="lazy">
_x,y before. Source: [http://www.ams.org/samplings/feature-column/fcarc-svd](http://www.ams.org/samplings/feature-column/fcarc-svd" rel="noopener" target="<em>blank" title=")</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/9MmrSVNihenregEuCPu3r56uWv1iOz3xv4zE" alt="Image" width="252" height="252" loading="lazy">
_x,y after. Source: [http://www.ams.org/samplings/feature-column/fcarc-svd](http://www.ams.org/samplings/feature-column/fcarc-svd" rel="noopener" target="<em>blank" title=")</em></p>
<p>As we can see, the plane is horizontally stretched by a factor of 3, while there is no vertical change.</p>
<p>Now, if we take another matrix, <em>M’:</em></p>
<pre><code>M<span class="hljs-string">' = | 2 1 |     | 1 2 |</span>
</code></pre><p>The effect is:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/F-HDjcaeJ-MUM41sRUCxMjmPIMUn611YtNhw" alt="Image" width="252" height="252" loading="lazy">
_x,y before. Source: [http://www.ams.org/samplings/feature-column/fcarc-svd](http://www.ams.org/samplings/feature-column/fcarc-svd" rel="noopener" target="<em>blank" title=")</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/tD2f46D10GMQtR9zHl70C8agQ3X6pJbaRcuf" alt="Image" width="252" height="252" loading="lazy">
_x,y after. Source: [http://www.ams.org/samplings/feature-column/fcarc-svd](http://www.ams.org/samplings/feature-column/fcarc-svd" rel="noopener" target="<em>blank" title=")</em></p>
<p>It is not so clear how to simply describe the geometric effect of the transformation. However, let’s rotate our grid through a 45 degree angle and see what happens.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/H2vS1tWqmjZM104LEw-2Xo5RwPWvgRX6f0sT" alt="Image" width="252" height="252" loading="lazy">
_x,y before, in a grid through a 45 degree angle. Source: [http://www.ams.org/samplings/feature-column/fcarc-svd](http://www.ams.org/samplings/feature-column/fcarc-svd" rel="noopener" target="<em>blank" title=")</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/mck7i9bXbMhjH6up91o0GUAbFUSAl0KGuFLm" alt="Image" width="252" height="252" loading="lazy">
_x,y after, in a grid through a 45 degree angle. Source: [http://www.ams.org/samplings/feature-column/fcarc-svd](http://www.ams.org/samplings/feature-column/fcarc-svd" rel="noopener" target="<em>blank" title=")</em></p>
<p>We see now that this new grid is transformed in the same way that the original grid was transformed by the diagonal matrix: <strong>the grid is stretched by a factor of 3 in one direction</strong>.</p>
<p>Now let’s use some definitions. <em>M</em> is a <strong>diagonal matrix</strong> (the entries outside the main diagonal are all zero) and both <em>M</em> and <em>M’</em> are <a target="_blank" href="https://en.wikipedia.org/wiki/Symmetric_matrix"><strong>symmetric</strong></a> (if we get the columns and use them as new rows, we will get the same matrix).</p>
<p>Multiplying by a <strong>diagonal matrix</strong> results in a <a target="_blank" href="https://en.wikipedia.org/wiki/Scaling_(geometry)">scaling</a> effect (a linear transformation that enlarges or shrinks objects by a scale factor).</p>
<blockquote>
<p>The effect we saw (the same result for both <em>M</em> and <em>M’</em>) is a very special situation that results from the fact that the matrix <em>M’</em> is symmetric. If we have a symmetric 2 x 2 matrix, it turns out that we may always rotate the grid in the domain so that the matrix acts by stretching and perhaps reflecting in the two directions. In other words, symmetric matrices behave like diagonal matrices. — <a target="_blank" href="http://www.ams.org/samplings/feature-column/fcarc-svd">David Austin</a></p>
<p>“This is the geometric essence of the singular value decomposition for 2 x 2 matrices: for any 2 x 2 matrix, we may find an orthogonal grid that is transformed into another orthogonal grid.” — <a target="_blank" href="http://www.ams.org/samplings/feature-column/fcarc-svd">David Austin</a></p>
</blockquote>
<p>We will express this fact using vectors: with an appropriate choice of orthogonal unit vectors <em>v1</em> and <em>v2</em>, the vectors <em>Mv1</em> and <em>Mv2</em> are orthogonal.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Ba9G7FFiNLv7Ukb4-f0cuanwn3Il1TxxOHf9" alt="Image" width="252" height="252" loading="lazy">
_v1 and v2 in the original grid. Source: [http://www.ams.org/samplings/feature-column/fcarc-svd](http://www.ams.org/samplings/feature-column/fcarc-svd" rel="noopener" target="<em>blank" title=")</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/pwPiTYpUd7g6bZGdtl3QaY5Znq3Zu-8Uh2uB" alt="Image" width="252" height="252" loading="lazy">
_Mv1 and Mv2 in the new grid. Source: [http://www.ams.org/samplings/feature-column/fcarc-svd](http://www.ams.org/samplings/feature-column/fcarc-svd" rel="noopener" target="<em>blank" title=")</em></p>
<p>We will use <em>n1</em> and <em>n2</em> to denote unit vectors in the direction of <em>Mv1</em> and <em>Mv2</em>. The lengths of <em>Mv1</em> and <em>Mv2</em> — denoted by σ1 and σ2 — describe the amount that the grid is stretched in those particular directions.</p>
<p>Now that we have a geometric essence, let’s go back to the formula:</p>
<pre><code>M = UDV*
</code></pre><ul>
<li><em>U</em> is a matrix whose columns are the vectors <em>n1</em> and <em>n2</em> (<strong>unit vectors of the ‘new’ grid,</strong> in the direction of <em>v1</em> and v2)</li>
<li><em>D</em> is a diagonal matrix whose entries are σ1 and σ2 (<strong>the length of each vector</strong>)</li>
<li><em>V*</em> is a matrix whose columns are <em>v1</em> and <em>v2</em> (<strong>vectors of the ‘old’ grid</strong>)</li>
</ul>
<p>Now that we understand a little about how SVD works, let’s see how LSI makes use of the technique to group texts. As <a target="_blank" href="https://scholar.google.com/citations?user=TcFyZgcAAAAJ">Ian Soboroff</a> shows on his Information Retrieval course <a target="_blank" href="https://www.csee.umbc.edu/~ian/irF02/lectures/12LSI.pdf">slides</a>:</p>
<ul>
<li><em>U</em> is a matrix for <strong>transforming new documents</strong></li>
<li><em>D</em> is the diagonal matrix that gives <strong>relative importance of dimensions</strong> (we will talk more about these dimensions in a minute)</li>
<li><em>V*</em> is a <strong>representation of <em>M</em> in <em>k</em> dimensions</strong></li>
</ul>
<p>To see how this works we will use document titles from two domains (Human Computer Interaction and Graph Theory). These examples are from the paper <a target="_blank" href="http://lsa.colorado.edu/papers/dp1.LSAintro.pdf">An Introduction to Latent Semantic Analysis</a>.</p>
<pre><code>c1: Human machine interface <span class="hljs-keyword">for</span> ABC computer applications c2: A survey <span class="hljs-keyword">of</span> user opinion <span class="hljs-keyword">of</span> computer system response time c3: System and human system engineering testing <span class="hljs-keyword">of</span> EPS
</code></pre><pre><code>m1: The generation <span class="hljs-keyword">of</span> random, binary, ordered trees m2: The intersection graph <span class="hljs-keyword">of</span> paths <span class="hljs-keyword">in</span> trees m3: Graph minors: A survey
</code></pre><p>The first step is to create a matrix with the number of times each term appears:</p>
<pre><code>| termo     | c1 | c2 | c3 | m1 | m2 | m3 | |-----------|----|----|----|----|----|----|| human     | <span class="hljs-number">1</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">1</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  || interface | <span class="hljs-number">1</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | | computer  | <span class="hljs-number">1</span>  | <span class="hljs-number">1</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | | user      | <span class="hljs-number">0</span>  | <span class="hljs-number">1</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | | system    | <span class="hljs-number">0</span>  | <span class="hljs-number">1</span>  | <span class="hljs-number">2</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | | survey    | <span class="hljs-number">0</span>  | <span class="hljs-number">1</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">1</span>  | | trees     | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">1</span>  | <span class="hljs-number">1</span>  | <span class="hljs-number">0</span>  | | graph     | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">1</span>  | <span class="hljs-number">1</span>  | | minors    | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">0</span>  | <span class="hljs-number">1</span>  |
</code></pre><p>Decomposing the matrix we have this (you can use this <a target="_blank" href="http://www.bluebit.gr/matrix-calculator/default.aspx">online tool</a> to apply the SVD):</p>
<pre><code># U Matrix (to transform <span class="hljs-keyword">new</span> documents)
</code></pre><pre><code><span class="hljs-number">-0.386</span>  <span class="hljs-number">0.222</span> <span class="hljs-number">-0.096</span> <span class="hljs-number">-0.458</span>  <span class="hljs-number">0.357</span> <span class="hljs-number">-0.105</span><span class="hljs-number">-0.119</span>  <span class="hljs-number">0.055</span> <span class="hljs-number">-0.434</span> <span class="hljs-number">-0.379</span>  <span class="hljs-number">0.156</span> <span class="hljs-number">-0.040</span><span class="hljs-number">-0.345</span> <span class="hljs-number">-0.062</span> <span class="hljs-number">-0.615</span> <span class="hljs-number">-0.089</span> <span class="hljs-number">-0.264</span>  <span class="hljs-number">0.135</span><span class="hljs-number">-0.226</span> <span class="hljs-number">-0.117</span> <span class="hljs-number">-0.181</span>  <span class="hljs-number">0.290</span> <span class="hljs-number">-0.420</span>  <span class="hljs-number">0.175</span><span class="hljs-number">-0.760</span>  <span class="hljs-number">0.218</span>  <span class="hljs-number">0.493</span>  <span class="hljs-number">0.133</span> <span class="hljs-number">-0.018</span>  <span class="hljs-number">0.044</span><span class="hljs-number">-0.284</span> <span class="hljs-number">-0.498</span> <span class="hljs-number">-0.176</span>  <span class="hljs-number">0.374</span>  <span class="hljs-number">0.033</span> <span class="hljs-number">-0.311</span><span class="hljs-number">-0.013</span> <span class="hljs-number">-0.321</span>  <span class="hljs-number">0.289</span> <span class="hljs-number">-0.571</span> <span class="hljs-number">-0.582</span> <span class="hljs-number">-0.386</span><span class="hljs-number">-0.069</span> <span class="hljs-number">-0.621</span>  <span class="hljs-number">0.185</span> <span class="hljs-number">-0.252</span>  <span class="hljs-number">0.236</span>  <span class="hljs-number">0.675</span><span class="hljs-number">-0.057</span> <span class="hljs-number">-0.382</span>  <span class="hljs-number">0.005</span>  <span class="hljs-number">0.085</span>  <span class="hljs-number">0.453</span> <span class="hljs-number">-0.485</span>
</code></pre><p>Matrix that gives relative importance of dimensions:</p>
<pre><code># D Matrix (relative importance <span class="hljs-keyword">of</span> dimensions)
</code></pre><pre><code><span class="hljs-number">2.672</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">1.983</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">1.625</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">1.563</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">1.263</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.499</span>
</code></pre><p>Representation of <em>M</em> in <em>k</em> dimensions (in this case, we have <em>k</em> documents):</p>
<pre><code># V* Matrix (representation <span class="hljs-keyword">of</span> M <span class="hljs-keyword">in</span> k dimensions)
</code></pre><pre><code><span class="hljs-number">-0.318</span> <span class="hljs-number">-0.604</span> <span class="hljs-number">-0.713</span> <span class="hljs-number">-0.005</span> <span class="hljs-number">-0.031</span> <span class="hljs-number">-0.153</span> <span class="hljs-number">0.108</span> <span class="hljs-number">-0.231</span>  <span class="hljs-number">0.332</span> <span class="hljs-number">-0.162</span> <span class="hljs-number">-0.475</span> <span class="hljs-number">-0.757</span><span class="hljs-number">-0.705</span> <span class="hljs-number">-0.294</span>  <span class="hljs-number">0.548</span>  <span class="hljs-number">0.178</span>  <span class="hljs-number">0.291</span>  <span class="hljs-number">0.009</span><span class="hljs-number">-0.593</span>  <span class="hljs-number">0.453</span> <span class="hljs-number">-0.122</span> <span class="hljs-number">-0.365</span> <span class="hljs-number">-0.527</span>  <span class="hljs-number">0.132</span> <span class="hljs-number">0.197</span> <span class="hljs-number">-0.531</span>  <span class="hljs-number">0.254</span> <span class="hljs-number">-0.461</span> <span class="hljs-number">-0.274</span>  <span class="hljs-number">0.572</span><span class="hljs-number">-0.020</span>  <span class="hljs-number">0.087</span> <span class="hljs-number">-0.033</span> <span class="hljs-number">-0.772</span>  <span class="hljs-number">0.580</span> <span class="hljs-number">-0.242</span>
</code></pre><p>Okay, we have the matrices. But now the matrix is not 2 x 2. Do we really need the amount of dimensions that this term-document matrix has? Are all dimensions important features for each term and each document?</p>
<p>Let’s go back to the example of David Austin. Let’s say now we have <em>M’’</em>:</p>
<pre><code>M<span class="hljs-string">''</span> = | <span class="hljs-number">1</span> <span class="hljs-number">1</span> |      | <span class="hljs-number">2</span> <span class="hljs-number">2</span> |
</code></pre><p><img src="https://cdn-media-1.freecodecamp.org/images/W8PjHGo4AFtHcw1l3BafJSFftpNXKxX-40Ji" alt="Image" width="252" height="252" loading="lazy">
<em>x,y before</em></p>
<p>Now <em>M’’</em> <strong>is no longer a symmetric matrix</strong>. For this matrix, the value of σ2 is zero. On the grid, the result of the multiplication is:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/fVr6rdBLQKOk7vQEHZp509Q6UtXTCSMwVNwc" alt="Image" width="252" height="252" loading="lazy">
<em>x,y after</em></p>
<p>We have that if a value from the main diagonal of <em>D</em> is zero <strong>this term does not appear in the decomposition of <em>M</em></strong>.</p>
<blockquote>
<p>In this way, we see that the <em>rank</em> of <em>M</em>, which is the dimension of the image of the linear transformation, is equal to the number of non-zero values. — <a target="_blank" href="http://www.ams.org/samplings/feature-column/fcarc-svd">David Austin</a></p>
</blockquote>
<p>What LSI does is to change the dimensionality of the terms.</p>
<blockquote>
<p>In the original matrix terms are k-dimensional (k is the number of documents). The new space has lower dimensionality, so the dimensions are now groups of terms that tend to co-occur in the same documents. — <a target="_blank" href="https://www.csee.umbc.edu/~ian/irF02/lectures/12LSI.pdf">Ian Soboroff</a></p>
</blockquote>
<p>Now we can go back to the example. Let’s create a space with two dimensions. For this we will use only two values of the diagonal matrix <em>D</em>:</p>
<pre><code># D2 Matrix
</code></pre><pre><code><span class="hljs-number">2.672</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">1.983</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.0000</span><span class="hljs-number">.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span> <span class="hljs-number">0.000</span>
</code></pre><p>As <a target="_blank" href="http://webhome.cs.uvic.ca/~thomo/">Alex Thomo</a> explains in <a target="_blank" href="http://webhome.cs.uvic.ca/~thomo/svd.pdf">this tutorial</a>, <strong>terms</strong> are represented by the row vectors of <em>U2 x D2</em> (<em>U2</em> is <em>U</em> with only 2 dimensions) and <strong>documents</strong> are represented by the column vectors of <em>D2 x V2*</em> (<em>V2*</em> is <em>V*</em> with only 2 dimensions). We multiply by <em>D2</em> because <em>D</em> is the diagonal matrix that gives relative importance of dimensions, remember?</p>
<p>Then we calculate the coordinates of each term and each document through these multiplications. The result is:</p>
<pre><code>human     = (<span class="hljs-number">-1.031</span>, <span class="hljs-number">0.440</span>)interface = (<span class="hljs-number">-0.318</span>, <span class="hljs-number">0.109</span>)computer  = (<span class="hljs-number">-0.922</span>, <span class="hljs-number">-0.123</span>)user      = (<span class="hljs-number">-0.604</span>, <span class="hljs-number">-0.232</span>)system    = (<span class="hljs-number">-2.031</span>, <span class="hljs-number">-0.232</span>) survey    = (<span class="hljs-number">-0.759</span>, <span class="hljs-number">-0.988</span>)trees     = (<span class="hljs-number">-0.035</span>, <span class="hljs-number">-0.637</span>)graph     = (<span class="hljs-number">-0.184</span>, <span class="hljs-number">-1.231</span>) minors    = (<span class="hljs-number">-0.152</span>, <span class="hljs-number">-0.758</span>)
</code></pre><pre><code>c1        = (<span class="hljs-number">-0.850</span>, <span class="hljs-number">0.214</span>)c2        = (<span class="hljs-number">-1.614</span>, <span class="hljs-number">-0.458</span>)c3        = (<span class="hljs-number">-1.905</span>, <span class="hljs-number">0.658</span>)m1        = (<span class="hljs-number">-0.013</span>, <span class="hljs-number">-0.321</span>)m2        = (<span class="hljs-number">-0.083</span>, <span class="hljs-number">-0.942</span>)m3        = (<span class="hljs-number">-0.409</span>, <span class="hljs-number">-1.501</span>)
</code></pre><p>Using matplotlib to visualize this, we have:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/lbXXYC51B2DdA6H7syhguOtJaRL2P7hDUznE" alt="Image" width="776" height="504" loading="lazy">
<em>The result for each term and each document</em></p>
<p>Cool, right? The vectors in red are Human Computer Interaction documents and the blue ones are of Graph Theory documents.</p>
<p>What about the choice of the number of dimensions?</p>
<blockquote>
<p>The number of dimensions retained in LSI is an empirical issue. Because the underlying principle is that the original data should not be perfectly regenerated but, rather, an optimal dimensionality should be found that will cause correct induction of underlying relations, the customary factor-analytic approach of choosing a dimensionality that most parsimoniously represent the true variance of the original data is not appropriate. — <a target="_blank" href="http://lsa.colorado.edu/papers/dp1.LSAintro.pdf">Source</a></p>
</blockquote>
<p>The measure of similarity computed in the reduced dimensional space is usually, but not always, the cosine between vectors.</p>
<p>And now we can go back to the dataset with theses from the Universities. I used the <a target="_blank" href="https://radimrehurek.com/gensim/models/lsimodel.html">lsi model</a> from <a target="_blank" href="https://radimrehurek.com/gensim/index.html">gensim</a>. I did not find many differences between the works of the Universities (all seemed to belong to the same cluster). The topic that most differentiated the works of the Universities was this one:</p>
<pre><code>y topic:[(<span class="hljs-string">'object'</span>, <span class="hljs-number">0.29383227033104375</span>), (<span class="hljs-string">'software'</span>, <span class="hljs-number">-0.22197520420133632</span>), (<span class="hljs-string">'algorithm'</span>, <span class="hljs-number">0.20537550622495102</span>), (<span class="hljs-string">'robot'</span>, <span class="hljs-number">0.18498675015157251</span>), (<span class="hljs-string">'model'</span>, <span class="hljs-number">-0.17565360130127983</span>), (<span class="hljs-string">'project'</span>, <span class="hljs-number">-0.164945961528315</span>), (<span class="hljs-string">'busines'</span>, <span class="hljs-number">-0.15603883815175643</span>), (<span class="hljs-string">'management'</span>, <span class="hljs-number">-0.15160458583774569</span>), (<span class="hljs-string">'process'</span>, <span class="hljs-number">-0.13630070297362168</span>), (<span class="hljs-string">'visual'</span>, <span class="hljs-number">0.12762128292042879</span>)]
</code></pre><p>Visually we have:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/qGWt-ux3yZvGy6Oii3d39jkdaSEe03EKj3qF" alt="Image" width="607" height="578" loading="lazy">
<em>Visualization for topic y</em></p>
<p>In the image the <em>y</em> topic is on the y-axis. We can see that Carnegie Mellon theses are more associated with <strong>‘object’, ‘robot’, and ‘algorithm’</strong> and the theses from UFPE are more associated with <strong>‘software’, ‘project’, and ‘business’</strong>.</p>
<p>You can see the <strong>Jupyter notebook with the code</strong> <a target="_blank" href="https://github.com/dmesquita/tdcfloripa2017/blob/master/create_clusters.ipynb">here</a>.</p>
<h3 id="heading-step-2-investigating-the-nature-of-the-works">Step 2 — Investigating the nature of the works</h3>
<p>I always had the impression that in Brazil, students produce many theses with literature review, while in the other Universities they made few theses like this. To check, I analyzed the titles of the theses.</p>
<p>Usually when a thesis is a literature review the word ‘study’ appears in the title. I then took all the titles of all the theses and checked the words that appear the most, for each University. The results were:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/3m9h8rxe9TqywOoZL27WNJ3hIJLdQQPn7GNy" alt="Image" width="738" height="578" loading="lazy">
<em>Words from titles of UFPE</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/qz3qYBjfO9zaLmgzesQH3Hh8LsaAYeTCEwFZ" alt="Image" width="738" height="578" loading="lazy">
<em>Words from titles of Carnegie Mellon</em></p>
<p>You can see the <strong>Jupyter notebook with the code</strong> <a target="_blank" href="https://github.com/dmesquita/tdcfloripa2017/blob/master/analyze_titles.ipynb">here</a>.</p>
<h3 id="heading-findings">Findings</h3>
<p>The sense I got from this simple analysis was that the themes of the works did not differ much. But it was possible to visualize what seems to be the specialties of each institution. The Federal University of Pernambuco produces more work related to <strong>projects and business</strong> and Carnegie Mellon produces more work related to <strong>robots and algorithms</strong>. In my view, this difference of specialties is not something bad, it simply shows that each university is specialized in certain areas.</p>
<p>A takeaway was that in Brazil we need to produce different works instead of just doing literature review.</p>
<p>Something important that I realized while doing the analysis (and that did not come from the findings of the analysis itself), was that only having the best thesis is not enough. I started the analysis trying to identify <em>why they produce better works than us</em> and what can we do to <em>get there</em> and <em>become more well known.</em> But I felt that maybe one way to <em>get there</em> is simply to show more of our work and to exchange more knowledge with them. The reason is because this can force us to produce more relevant articles and improve with feedback.</p>
<p>I also think that this is for everyone, both for university students and for us professionals alike. This quote that sums it up well:</p>
<blockquote>
<p>“It’s not enough to be good. In order to be found, you have to be findable.” — <a target="_blank" href="https://www.goodreads.com/work/quotes/25771145-show-your-work-10-ways-to-share-your-creativity-and-get-discovered">Austin Kleon</a></p>
</blockquote>
<p>And that’s it, thank you for reading!</p>
<p><em>If you found this article helpful, it would mean a lot if you click the ? and share with friends. Follow me for more articles about Data Science and Machine Learning.</em></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
