<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[MintyU | Feed]]></title><description><![CDATA[Software Development Blog]]></description><link>https://mintyu.github.io</link><generator>GatsbyJS</generator><lastBuildDate>Tue, 04 Oct 2022 11:46:27 GMT</lastBuildDate><item><title><![CDATA[TCP와 QUIC 파일 전송 성능 비교 (3)]]></title><description><![CDATA[6. 실험 결과 근거리 네트워킹 실험 근거리 네트워킹 실험에서는 동일한 기기 상에서 서버와 클라이언트를 각각 실행한 환경에서 파일을 전송 후 성능을 측정했습니다. 해당 실험 환경의 Round Trip Time은 min / avg / max…]]></description><link>https://mintyu.github.io/capstone-quic_03/</link><guid isPermaLink="false">https://mintyu.github.io/capstone-quic_03/</guid><pubDate>Mon, 27 Jun 2022 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;6-실험-결과&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#6-%EC%8B%A4%ED%97%98-%EA%B2%B0%EA%B3%BC&quot; aria-label=&quot;6 실험 결과 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;6. 실험 결과&lt;/h2&gt;
&lt;h3 id=&quot;근거리-네트워킹-실험&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EA%B7%BC%EA%B1%B0%EB%A6%AC-%EB%84%A4%ED%8A%B8%EC%9B%8C%ED%82%B9-%EC%8B%A4%ED%97%98&quot; aria-label=&quot;근거리 네트워킹 실험 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;근거리 네트워킹 실험&lt;/h3&gt;
&lt;p&gt;근거리 네트워킹 실험에서는 동일한 기기 상에서 서버와 클라이언트를 각각 실행한 환경에서 파일을 전송 후 성능을 측정했습니다.&lt;/p&gt;
&lt;p&gt;해당 실험 환경의 Round Trip Time은 min / avg / max / stddev = 0.037 / 0.079 / 0.149 / 0.026 ms 입니다.&lt;/p&gt;
&lt;p&gt;TCP, TCP+TLS, QUIC 각각 1개에서 6개로 연결을 병렬로 늘려가며 실험했습니다. 실험 결과는 각 케이스마다 10회 전송 후 전송 시간의 평균을 도출했습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/6f674bd1438a1c0df265d0ab87cfb2d5/2cefc/1.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 67.56756756756756%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAABYlAAAWJQFJUiTwAAABoUlEQVQ4y5VTi27DIAzM/3/lHtrSJmlePEOAm21K01bVplGdahtjfIfT5JwBMMpafETX9xiGAdZaiZmgxWfs+y4xpZT4XddhXVeJca0mp4RxXnFxiSOIMcI6D++9HOakPQbxGbzPsRBKzDkndqI6pSB3SOhcxpvKcBEPK8n+4Zf0jOfFsVtBvoFP+ZjxTkVbk6FDLsXwPzTcqtYaMRXK1JPQ/6ainzrhQyUM5C9bwewP+9nnJpqHtq+UDroA5WH0IEkKTvawz3c+2+MGNPxS8zwL7aoNd51S0aR2/YhXdEu84Sdnygy1KgQfZL+KXH5Px3N5LEEq/5XYjfIYO0zbgF6dcJq+ZLZ41pz1cknwu/x7GiljjOxxTs2z1mDbNn7lJAGZv0iJcabCPfqtRe+/0ekW3drKRYxZT3SwzGNMV5C9uSAXNvhj+Wxg8gqbNUFhiReM+xkXRjhhCK34Kk1wlCtzWKc/UbcVVcNX+t/0FRTf7EQ76lKQKfOjsDb8/S7LcrOrTlorKMKxV3Rc1kXyWGur7WvK959WtadpkkK/5yb8AH7vSd2Al248AAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;근거리&quot;
        title=&quot;근거리&quot;
        src=&quot;/static/6f674bd1438a1c0df265d0ab87cfb2d5/fcda8/1.png&quot;
        srcset=&quot;/static/6f674bd1438a1c0df265d0ab87cfb2d5/12f09/1.png 148w,
/static/6f674bd1438a1c0df265d0ab87cfb2d5/e4a3f/1.png 295w,
/static/6f674bd1438a1c0df265d0ab87cfb2d5/fcda8/1.png 590w,
/static/6f674bd1438a1c0df265d0ab87cfb2d5/efc66/1.png 885w,
/static/6f674bd1438a1c0df265d0ab87cfb2d5/c83ae/1.png 1180w,
/static/6f674bd1438a1c0df265d0ab87cfb2d5/2cefc/1.png 1400w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;실험 결과, QUIC이 더 느린 것으로 나타났으며 TCP 연결이 오히려 더 빨랐고, TLS까지 사용했을 경우 TLS 연결에 필요한 추가적인 시간만큼 더 지연이 생겼으며 그조차도 QUIC보다는 빨랐습니다.&lt;/p&gt;
&lt;p&gt;RTT 값이 매우 극단적으로 작은 환경에서 실험했기 때문에, 이 경우는 TCP의 문제점으로 지적했던 3-Way Handshake 연결 설정이나 HOL Blocking 문제가 크게 나타나지 않았습니다. RTT 값이 극단적으로 작기 때문에 3-Way Handshake를 통한 연결 수립에 큰 시간이 걸리지 않았을 뿐만 아니라 Packet Loss도 0%로 전혀 발생하지 않았기 때문에 HOL Blocking으로 인한 문제도 발생하지 않았기 때문입니다.&lt;/p&gt;
&lt;p&gt;또한 근거리 네트워킹 실험에서는 연결 수립에 필요한 시간이 워낙 짧다 보니 연결 수를 늘려 병렬 파일 전송을 했을 경우 전송 시간이 줄어드는 것까지 확인할 수 있었습니다.&lt;/p&gt;
&lt;h3 id=&quot;원거리-네트워킹-실험&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%9B%90%EA%B1%B0%EB%A6%AC-%EB%84%A4%ED%8A%B8%EC%9B%8C%ED%82%B9-%EC%8B%A4%ED%97%98&quot; aria-label=&quot;원거리 네트워킹 실험 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;원거리 네트워킹 실험&lt;/h3&gt;
&lt;p&gt;이번엔 RTT 값이 매우 큰, 해외 서버에서 데이터를 전송받는 경우를 가정해봅시다. 원거리 네트워킹 실험의 환경을 구축하기 위해서 AWS EC2 서비스를 이용했습니다.&lt;/p&gt;
&lt;p&gt;us-east-1 서버(미국 동부 - 버지니아 북부)를 이용하여, 2 vCPU와 4 GiB 메모리의 t2.medium 인스턴스로 서버를 동작했습니다.&lt;/p&gt;
&lt;p&gt;해당 실험 환경의 Round Trip Time은 min / avg / max / stddev = 194.567 / 195.514 / 208.020 / 1.646 ms 입니다.&lt;/p&gt;
&lt;p&gt;마찬가지로 TCP, CP+TLS, QUIC 각각 1개에서 6개로 연결을 병렬로 늘려가며 실험하였으며 각 케이스마다 10회 전송 후 전송 시간의 평균을 도출했습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/15fc995cb7bfd0a3813f84e570da8c33/2cefc/2.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 67.56756756756756%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAABYlAAAWJQFJUiTwAAABfUlEQVQ4y41TiZaDIAzs///k9tit1ipVQQUhzBIs1qp70DdmSOLQmHDw3mNaHi48c9HglucQQkSvxYhS3IPvhr7vo49tWZYoiiJaIpoUgtYhvmRtdBIFUecipv0LztGO7z13FmSSgAX/C3trFvzPWucNwxDLH8cRxphYaRTkzV6ZayRRFmqaBkqpyKWSaGWLQQ+T4PL030rTWqNuanRdNx+wKZlP4ITwKsgTN3tVJzBqC9kqdCoIOZr90zcPcTKQtoam8A/5G0gp5yTjNAqZIXt8IQ/IxCeKJkNjBIS94W6uKE0WkEfOEGOBxlbonXqVPJKOgWK44qHvkK6OaN0Dimr0JGMFfue36bIeB1zVBZ1VWMz5BjynP8Fx49IcGutwDlV/KgTr8dESTpJwVoSjnMB8ivnAJxzD/hRwefJyCILLq+djU3xq+7PDPnXgLbbhz/0hjQM3hrvN4BljmxrG4JljtG07x5izj3myb1cvreWMJV5VVbwA67zlni/HN24mSyU8DGZ8AAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;원거리&quot;
        title=&quot;원거리&quot;
        src=&quot;/static/15fc995cb7bfd0a3813f84e570da8c33/fcda8/2.png&quot;
        srcset=&quot;/static/15fc995cb7bfd0a3813f84e570da8c33/12f09/2.png 148w,
/static/15fc995cb7bfd0a3813f84e570da8c33/e4a3f/2.png 295w,
/static/15fc995cb7bfd0a3813f84e570da8c33/fcda8/2.png 590w,
/static/15fc995cb7bfd0a3813f84e570da8c33/efc66/2.png 885w,
/static/15fc995cb7bfd0a3813f84e570da8c33/c83ae/2.png 1180w,
/static/15fc995cb7bfd0a3813f84e570da8c33/2cefc/2.png 1400w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;원거리 환경이 되자 결과가 뒤집혔습니다. 이 경우 QUIC이 TCP보다 전송 속도가 더 빨랐습니다.&lt;/p&gt;
&lt;p&gt;자세한 결과 분석을 위해 파일 전송 시간에서 연결을 수립하는 데 걸린 시간만 따로 측정해보았습니다. 아래는 그 결과 그래프입니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/f0bccdbd4be85d66e415f717ff90917b/2cefc/3.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 67.56756756756756%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAABYlAAAWJQFJUiTwAAAB8ElEQVQ4y41Ua28bIRD0//9j/dBPVdUkqhQ5thPHx70POJ7LdIFz4lRRU+QRLCxzswN4l1JCbsSYtEPfd2jbFuM4wpFBP3AsWlhroZRC01w4ZwAFjXnqIHhNCIGu65C5dpnMe48YCS7EsjHDOYdIAdbVOMZY8owxHDtQDJxjObZve4ioEmbmq9IP7ZOpfy+xwkyUv34l/V/QhkqdYGPCYJkwy5znucgtidx/hZSous69dIQXRTjKhNFuHn7V0l8lBg56CxwV8KwBHW5Kzuqy0Z+WRCkbXOm4lz7hpBIOrKYzBB83pdjK599OSvmBMJMUok2XY2/ESkxCOGsqpCglv8uvAggm6veS043BGYo3nnXisrLZtzXXdUMrVJywhAG957vpXyHjWA9FLkv1hqo3Bwm8sjfSV4qIwBsXzAzhBYRrMIapxJJVcRGg7YPl2jgfsBcX3D0f8Pvlgcd3aMd7iOkH9pdvODTfcWl/oml/Ye4esfR7qOEIPZwg+ycsw6HMyfFUS45MepRP6PQ9pH2E9g2UPSO6DuQGJAYCy2ZFyXPsR1BBHvclLvNhroTWGqx6xWo81jVALQZm9SWWyjDYL20KruPVOO4t5kWXsV455tzt6X08lEj8crZTSEQlYZomaP5zyOvpesrlVsS3A88v7g/sIkazTItsUgAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;연결 시간&quot;
        title=&quot;연결 시간&quot;
        src=&quot;/static/f0bccdbd4be85d66e415f717ff90917b/fcda8/3.png&quot;
        srcset=&quot;/static/f0bccdbd4be85d66e415f717ff90917b/12f09/3.png 148w,
/static/f0bccdbd4be85d66e415f717ff90917b/e4a3f/3.png 295w,
/static/f0bccdbd4be85d66e415f717ff90917b/fcda8/3.png 590w,
/static/f0bccdbd4be85d66e415f717ff90917b/efc66/3.png 885w,
/static/f0bccdbd4be85d66e415f717ff90917b/c83ae/3.png 1180w,
/static/f0bccdbd4be85d66e415f717ff90917b/2cefc/3.png 1400w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;TCP 연결의 수를 늘릴수록 점점 연결에 필요한 시간이 증가했습니다. 이에 반해, QUIC의 경우 연결 후 Stream을 늘렸지만 별다른 연결 시간 증가는 확인할 수 없었습니다.&lt;/p&gt;
&lt;p&gt;뿐만 아니라, QUIC은 2번째 전송부터는 0-RTT로 바로 데이터 전송을 시도할 수 있습니다. 이에 따라 그래프에서 연결 시간이 기존보다 더 줄어든것까지 확인할 수 있습니다.&lt;/p&gt;
&lt;h3 id=&quot;요약&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%9A%94%EC%95%BD&quot; aria-label=&quot;요약 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;요약&lt;/h3&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/8870f1ad8d9ba06109b21bcc128a456a/2cefc/4.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 67.56756756756756%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAABYlAAAWJQFJUiTwAAABkElEQVQ4y61T227bMAzN///ZsDZp2gFDX4bWjuOkcayLJetm+YyUk64ohj0slXAkkpKPeRFX8zyDR0gT9DDAWgulFEZnYZzBQLYrnHPQWpc7rMdxhDEG5qKnlLBishgjfRzQC0FkEl3XQRkJZRVEL9D3PQSd6UHj3J8hpSw2r1X5uZCi2GOIC2HOmdbF01tGplkIS9jzwhmnCBUkhqjfoYOCiQM694bavKCxFfa2pr3Gzr5e5Apv7rAQjpSLgUjascLRNXSwR+cPBSff4kT6ybUwSWHKESkH2kPZU/bvNsxXD2k+nw/41SvspEcjJ7QqFewJjWR5IjmiFh4131Gh3K3EH7nVHiuuElePw53yjDRlCnv6K/jsn6BaFEIOmfN4K2hZQv7KsWJm7ymxFBI/n1swXz1ksi/z8PoODbWOpg4w3FpmwMi5tVQwaitHOXaj/YAR/hP4XqBICyGHfHr+iePjBsenLar779hvN2gfH9Bs13i96Aw+q9d3Rd5t7vFy943kNRq2/Xj60Cn/OeZP8m96WEFODvPjgwAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;요약&quot;
        title=&quot;요약&quot;
        src=&quot;/static/8870f1ad8d9ba06109b21bcc128a456a/fcda8/4.png&quot;
        srcset=&quot;/static/8870f1ad8d9ba06109b21bcc128a456a/12f09/4.png 148w,
/static/8870f1ad8d9ba06109b21bcc128a456a/e4a3f/4.png 295w,
/static/8870f1ad8d9ba06109b21bcc128a456a/fcda8/4.png 590w,
/static/8870f1ad8d9ba06109b21bcc128a456a/efc66/4.png 885w,
/static/8870f1ad8d9ba06109b21bcc128a456a/c83ae/4.png 1180w,
/static/8870f1ad8d9ba06109b21bcc128a456a/2cefc/4.png 1400w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;RTT가 매우 작은 환경인 경우에는 TCP 전송이 더 빨라 유리하지만, RTT가 커지기 시작하면(일반적으로 서비스하는 경우) TCP보다 &lt;strong&gt;QUIC을 통한 파일 전송이 더 유리할 수도 있다&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id=&quot;7-결론&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#7-%EA%B2%B0%EB%A1%A0&quot; aria-label=&quot;7 결론 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;7. 결론&lt;/h2&gt;
&lt;p&gt;앞선 실험 결과에서 알 수 있듯, RTT 값에 따라서 RTT 값이 작은 경우 TCP가 유리할 수는 있으나, 대부분의 서비스 환경에서의 RTT 수준이면 QUIC을 통한 파일 전송을 충분히 사용할 수 있을 것이라고 보이며, 충분히 경쟁력 있다고 생각합니다.&lt;/p&gt;
&lt;p&gt;아직 QUIC이 완벽히 정착되지 않았고, 최적화도 부족하며 무엇보다 충분한 Document와 Community가 축적될 시간이 없었다는 점에서 아쉽다고 느낄 수는 있으나, 실험 결과 충분히 TCP를 대체할만한 경쟁력을 갖추고있다고 생각됩니다.&lt;/p&gt;
&lt;p&gt;특히 병렬 전송이 필요한 경우에 TCP는 하나의 연결에 1:1 통신밖에 되지 않아서, 복수 개의 연결을 수립하고 병렬 전송해야 하기 때문에 이 과정에서 연결 추가에 따른 오버헤드가 발생했습니다. 하지만 QUIC은 Stream을 늘리는 데 별다른 시간 소모가 추가적으로 필요하지 않기 때문에 병렬 전송이 필요한 분야에서는 TCP보다 QUIC이 압도적으로 유리함을 실험을 통해 확인할 수 있었습니다.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[TCP와 QUIC 파일 전송 성능 비교 (2)]]></title><description><![CDATA[4. QUIC 소개 앞선 글에서 설명했듯, QUIC은 UDP를 기반으로 만들어진 전송 계층 프로토콜입니다. 하지만, 기존에 TCP를 필요로 하던 HTTP에서 TCP의 자리를 UDP 기반의 QUIC 프로토콜이 차지했다는 사실은, UDP…]]></description><link>https://mintyu.github.io/capstone-quic_02/</link><guid isPermaLink="false">https://mintyu.github.io/capstone-quic_02/</guid><pubDate>Sat, 25 Jun 2022 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;4-quic-소개&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#4-quic-%EC%86%8C%EA%B0%9C&quot; aria-label=&quot;4 quic 소개 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;4. QUIC 소개&lt;/h2&gt;
&lt;p&gt;앞선 글에서 설명했듯, QUIC은 UDP를 기반으로 만들어진 전송 계층 프로토콜입니다. 하지만, 기존에 TCP를 필요로 하던 HTTP에서 TCP의 자리를 UDP 기반의 QUIC 프로토콜이 차지했다는 사실은, UDP의 특징을 이해하고 있다면 다소 아이러니하게 들릴 수도 있습니다.&lt;/p&gt;
&lt;p&gt;물론, UDP는 데이터 전송의 신뢰성을 보장하지 못하지만, QUIC은 UDP 위에 새로운 계층을 추가하면서 신뢰성을 제공하게 됩니다. 이 추가된 계층은, 기존 TCP의 기능이었던 패킷 재전송이나 혼잡 제어, 속도 조정과 같은 기능을 제공해주게 되면서 UDP 기반 통신에도 신뢰성을 확보할 수 있게 되었습니다. 뿐만 아니라 TCP가 원래는 커널 내부에 구현되어있었지만, QUIC의 혼잡 제어 알고리즘은 어플리케이션으로 구현되어 수정이 쉽고 최적화가 용이하다는 장점이 있습니다.&lt;/p&gt;
&lt;h3 id=&quot;quic의-특징&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#quic%EC%9D%98-%ED%8A%B9%EC%A7%95&quot; aria-label=&quot;quic의 특징 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;QUIC의 특징&lt;/h3&gt;
&lt;p&gt;QUIC은 기존 TCP의 단점으로 지적된 점들을 해결하기 위해 신경 쓴 모습을 많이 볼 수 있습니다. 그 첫 번째는 연결에서 볼 수 있습니다. TCP는 3-Way Handshake를 거처 연결을 수립해야 하는 반면, QUIC은 1-RTT(1-Round Trip Time)으로 연결이 수립되며, 이전에 서버에 연결한 적이 있는 클라이언트와의 연결은 0-RTT로 연결이 수립되고 이론상 Handshake가 완료되기를 기다리지 않고 바로 데이터 전송을 시작할 수 있습니다. 이를 통해 연결 수립에 필요한 시간을 획기적으로 줄일 수 있었습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/a1d7e443d6b493c707e5261876d31f2d/ec3e2/1.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 54.05405405405405%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAALCAIAAADwazoUAAAACXBIWXMAAAsSAAALEgHS3X78AAABs0lEQVQoz01S246bMBTk/7+lD9n9gL0paba5PCxIq0ZNAxENFxtjDL4BATqYrdQROjrmzByPfex1DtM0RVG03W6Px6Pv++/vP3a73eGw3+8PT89P6816vfkeBME4TXeHvu/HcfQgg5gQgsXkaoMD57yRjdZ6IYwOf+I4J2ThtG3rNVKhjE6Idd1YrX7H2asfzY2GAbyXl+c4jpWSBSWMlULUgBACHb1vb5/9fQAVjR8fHqLwwrj4vCTO211rs95s64q9fkQfIZ0csixdrVbWWi8lTOrW2E6Z9vTr3DSNsYaWnPAmL2vCa1rV1po4yTMmurbDTjmhfhDArJcxRrhIGU+KipQCJ1FKJUVGBaNVQSqas9wAWg/jaLs7vq4fuq6fbUulsLE0yva2lgJmlNYJTfOSZCyfYzGL0XH6wnyv1pj5wlBYDoxotJnFSmVZ1jjgziml4Egpp/8AzixGxooiDMOlN8Sg4jwY+8/TCaNCjri8hZKxJElGN50vMWq8qiCDAeP8YIwYz+USLr6WFkjSND2fz/gJp7MYLUHFArNBvrye2+12vV4RC8ZgeKkub+P+D/j5F3bJbXv+oBDiAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;1&quot;
        title=&quot;1&quot;
        src=&quot;/static/a1d7e443d6b493c707e5261876d31f2d/fcda8/1.png&quot;
        srcset=&quot;/static/a1d7e443d6b493c707e5261876d31f2d/12f09/1.png 148w,
/static/a1d7e443d6b493c707e5261876d31f2d/e4a3f/1.png 295w,
/static/a1d7e443d6b493c707e5261876d31f2d/fcda8/1.png 590w,
/static/a1d7e443d6b493c707e5261876d31f2d/efc66/1.png 885w,
/static/a1d7e443d6b493c707e5261876d31f2d/ec3e2/1.png 997w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;두 번째로 3-Way Handshake와 함께 TCP의 문제점이라고 할 수 있었던 HOL Blocking 문제를 해결한 점도 눈에 띕니다. TCP 기반의 HTTP/2를 사용하는 대부분의 브라우저는 하나의 TCP 연결을 통해 수십, 수백 개의 병렬 전송을 합니다. 하지만, QUIC은 서로 다른 여러 개의 Stream을 설정하고 독립적으로 다룰 수 있기 때문에 특정 Stream에서 전송되던 데이터의 손실에도 다른 Stream의 전송에 영향을 주지 않습니다. 따라서, 패킷 손실률이 큰 환경에서 QUIC의 성능은 더 돋보이게 됩니다.&lt;/p&gt;
&lt;p&gt;세 번째 눈에 띄는 특징은 TLS(Transport Layer Security)입니다. TCP의 경우 따로 보안기능이 없었기 때문에, 보안 연결을 위해서는 반드시 TLS를 함께 사용해야 했습니다. 또한, TLS 연결을 수립하는 과정에서도 추가적인 Handshake가 필요했기에 TCP의 연결 시간 지연을 더더욱 부각시키는 상태였습니다. 하지만, QUIC은 TLS 1.3의 전송 보안을 기본으로 탑재하고 있습니다. TLS 1.3버전은 이전 버전과 비교했을 때 Handshake에 더 적은 RTT가 필요해졌기에 더 빠를 뿐만 아니라, TCP 연결 따로, TLS 연결 따로 순차적으로 진행하던것과 달리 QUIC은 연결과 동시에 TLS Handshake를 함께 진행합니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/12ea13caf4287246894ab72346871b7e/2e195/2.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 78.37837837837837%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAAAsSAAALEgHS3X78AAAC30lEQVQ4y2VUa2/aQBDkT/e3VMpPSKXyoaUibWiipm0e0KQEAsFgTDDYxtjG77c93TtjJ1JPsizf3M7Ozu65lSQJhOcJXM8DW3kBLJYSHNtGEASYTSdIs4JjQRhDJCwMQzpXYCYskEQhxzzHhrRcolWUQGTvkO1nKJ01itBCSN85MTMstLbIDcI8BXnkIHRNFAygddEfoXM9xmxjYO9G8OMULZ+U5RwvkAcGcvkaqdCpsnouuLYiQ+5TktUFMumSYw4pQplDN0zcjgS8e99Gtz9Fi5eiL+AL5yR1B8QmDyiKnL/97Qjp+jdhOj172ikbhecjGZ/vnvFjKODbzQO2qlYRHowd5tNHiNIaq/UGsiyjXnt1DWF8j+VKhqLp0PV9g/W63zG4/AV/pwFpwlSgZdsHMjuC43rYbjb4enaGdrvNAyyL/IwTGLaDzXaDD6en6Ha7VaL9kThN4SsqFp862E+OJQ+HQ5ycnODl5YV3kJecVyUPr29w9vEjVMLKo7Ia0+4fYE6f4WsaEpf8Zgp93+cjwMwXRRG9Xg9tIqia4pGAFDmpT00L2u0dlP6AYyXFJDRW5ZG88p0Ic9oQBTK18wWKqpKPEpmr8gNRHMPTdZizGTxFgSXM4ZAtdXCtmJGXZQnGxUt2yA/p7xCevEFMfkQ7vTnsU+cYWZGljRIWzB5OWlbfjULTNBGRb0UQIjAsqP0/kK9+vmYG3hDRTcoyvscwZtXb1Sh0SdVuMECizKlrUaOiOlSwaJrPA38Xx/IKrpJ8JAutsErCFdbscZ4hNNjddClzeSQsuaow8BBpY6RxUA11+TrcS93B1VxHkuU8psX9YCNAZguPT5ClFZIkbnxiS5ovMH+aQic/Wddr/1j54mQCcTyCczi8lhzTHyekxjiLBSnZ8VmsS2NYQHsH6nREZxjG9jMKZt77axkOiYgPNp+KVq2CLYHmsP6NvVXI/BqORkQe/4f5RLpcrZr9f3u9wjhUS/FFAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;2&quot;
        title=&quot;2&quot;
        src=&quot;/static/12ea13caf4287246894ab72346871b7e/fcda8/2.png&quot;
        srcset=&quot;/static/12ea13caf4287246894ab72346871b7e/12f09/2.png 148w,
/static/12ea13caf4287246894ab72346871b7e/e4a3f/2.png 295w,
/static/12ea13caf4287246894ab72346871b7e/fcda8/2.png 590w,
/static/12ea13caf4287246894ab72346871b7e/2e195/2.png 782w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;h2 id=&quot;5-quic이-tcp를-대체-가능할까&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#5-quic%EC%9D%B4-tcp%EB%A5%BC-%EB%8C%80%EC%B2%B4-%EA%B0%80%EB%8A%A5%ED%95%A0%EA%B9%8C&quot; aria-label=&quot;5 quic이 tcp를 대체 가능할까 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;5. QUIC이 TCP를 대체 가능할까?&lt;/h2&gt;
&lt;p&gt;QUIC을 살펴보다 보면, 정말 TCP를 대체할 수 있을 것 같은 느낌이 듭니다. 오히려, TCP에서 고질적으로 발생하던 문제들도 해결되었고 QUIC이 속도도 더 빠를 것 같다는 생각이 듭니다.&lt;/p&gt;
&lt;p&gt;정말 QUIC이 TCP의 대안으로 사용될 수 있을지, 실험을 통해 알아보도록 하겠습니다.&lt;/p&gt;
&lt;h3 id=&quot;실험-설계&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%8B%A4%ED%97%98-%EC%84%A4%EA%B3%84&quot; aria-label=&quot;실험 설계 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;실험 설계&lt;/h3&gt;
&lt;p&gt;우선, 성능 평가 분야는 &quot;파일 전송&quot;으로 선정했습니다. 크기가 같은 파일을 전송 프로토콜만 다르게 하여 전송하는 시간을 측정하여 간단하게 프로토콜의 전송 성능을 측정할 수 있을 것입니다.&lt;/p&gt;
&lt;p&gt;실험의 전체 틀은 아래와 같습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/0224df5bf0e28a8e1d7366aadd8739ff/ae628/3.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 47.2972972972973%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAJCAYAAAAywQxIAAAACXBIWXMAABYlAAAWJQFJUiTwAAACH0lEQVQoz2WQ30vTURjGjyEjITAiQeqqqxAKFJuzokjDHyER5dfcbRfd6lV/QgR1pdVNypohLFpe5NpMjchUuljM2splUk5tujlcm9u+X2fkp3OOgVEPPBfveZ/nOQ+vqDOuc7V/GuP+OEZvAKPnOe33xujom6Sp6yYlQiD+cI9kU/ctOh68kRqp7/Fj3B2R3pdSP4398jVE18ArOmegfQqMkSyGbw3nWzm/gxuvM1RWVLKvvELzUOVh+Zblitx1So3hT2OMmdqrMrofjiPGAn4UYnOzrCzF+JFKsl20wPxA6mMfxvn9GI3lOJsP4Gw5qN/UTmmS8WVWFheIL8zrjIDvGSIQCOhhcWmJRDJJKpUiV1CBc6xEPZyrKaW+SuA4KmistbH6+bHeKU1S6pVHeRWGfb7dwEgkQigUIpvNYllFfuXCJGYHaHXYaKwRNFQL2k6VkYg+YjsXoWBaxGIxgsEg3+PxncDhYcTo6KgeXC4XXq+XcDiMaW7KFl9YjT7hbLVtp6FkQ22ZbOgFa558wWRo6Ckej4eJiYndQL9/54br6+tkMhny+bz8XQbm3rMq73XxtI0Wu6D5hODSmb0kPvWzvTHDZvEnGxtZ7Umn0/83/BvF4hZsLbP2dQR7lY3jR4Rm/bEy1r690LtNpfkH6nxicHBQD6qZZZmSFrlcga10kNnJ27SdLOWCo4TWOqHbRqfu6J3SKK3yKK+C2+3mNxqyIozu2gRYAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;frame&quot;
        title=&quot;frame&quot;
        src=&quot;/static/0224df5bf0e28a8e1d7366aadd8739ff/fcda8/3.png&quot;
        srcset=&quot;/static/0224df5bf0e28a8e1d7366aadd8739ff/12f09/3.png 148w,
/static/0224df5bf0e28a8e1d7366aadd8739ff/e4a3f/3.png 295w,
/static/0224df5bf0e28a8e1d7366aadd8739ff/fcda8/3.png 590w,
/static/0224df5bf0e28a8e1d7366aadd8739ff/efc66/3.png 885w,
/static/0224df5bf0e28a8e1d7366aadd8739ff/c83ae/3.png 1180w,
/static/0224df5bf0e28a8e1d7366aadd8739ff/ae628/3.png 2604w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;전송에 사용할 파일은 mov 형식의 동영상 파일을 실험에 사용하기로 했습니다. 중간에 데이터가 손실되어 제대로 전달되지 않았을 경우, 영상이 정상적으로 재생되지 않기 때문에 영상을 실행시켜보면 정상적으로 모두 전송이 되었는지 간단하게 확인할 수 있습니다.&lt;/p&gt;
&lt;p&gt;TCP는 1:1 통신만 가능하기 때문에, 단일 TCP 연결을 통해 병렬 전송을 하게 됩니다. Stream 세션을 병렬로 늘릴 수 있는 QUIC과의 성능을 비교해보기 위해서, TCP 연결을 병렬로 여러개 수립해서 병렬 전송을 해 보고, 전송 성능이 향상되는지 측정해 보고 QUIC의 경우와 비교해 보도록 하겠습니다.&lt;/p&gt;
&lt;p&gt;또한 전송 속도가 빠를 때, 전송 속도가 느릴 때 속도의 차이도 비교하기 위해 근거리 네트워크, 장거리 네트워크의 경우를 각각 나누어 성능을 비교합니다.&lt;/p&gt;
&lt;p&gt;실험의 세부 설계는 아래와 같습니다.&lt;/p&gt;
&lt;h4 id=&quot;--서버-sw-설계-및-개발&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#--%EC%84%9C%EB%B2%84-sw-%EC%84%A4%EA%B3%84-%EB%B0%8F-%EA%B0%9C%EB%B0%9C&quot; aria-label=&quot;  서버 sw 설계 및 개발 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;- 서버 SW 설계 및 개발&lt;/h4&gt;
&lt;p&gt;실험에 사용된 서버 SW는 전송할 파일을 일정한 크기로 chunk 형태로 분할합니다. 분할되는 chunk의 수는 실험 시 클라이언트와 수립되는 연결의 수와 같습니다. 각각의 chunk는 번호가 매겨지며, 각각의 연결 당 한 개의 chunk를 전달합니다. 각 연결들을 통한 멀티스레딩을 통해 동시에 이뤄집니다. TCP 연결의 경우, TLS 연결 없이 순수 TCP 연결만 1개~6개로 전송하는 경우와 TLS 연결을 포함한 TCP 연결 1개~6개로 전송하는 경우를 각각 측정하며, QUIC 또한 Stream을 1개~6개로 늘려가며 파일을 분할 전송하게 됩니다.&lt;/p&gt;
&lt;p&gt;서버 SW는 Golang을 이용, &lt;a href=&quot;https://github.com/lucas-clemente/quic-go&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;github.com/lucas-clemente/quic-go&lt;/a&gt; 라이브러리를 통해 QUIC 전송을 구현했으며, TCP는 기본 패키지 net을 통해 전송을 구현했습니다. TCP+TLS는 crypto/tls 패키지를 통해 구현했습니다. TCP(+TLS) 서버는 나눠진 chunk들을 전송할 수 있도록 chunk당 하나의 TCP(+TLS) 연결을 생성합니다. 그리고 각각의 chunk들을 연결을 통해 전송합니다. chunk 순서는 클라이언트가 연결을 시도하는 순서와 같습니다. 주어진 chunk를 모두 전송한 경우, TCP(+TLS) 연결을 종료하며 클라이언트 측에 EOF를 전달합니다. QUIC 서버는 마찬가지로 나눠진 chunk들을 전송할 수 있도록 chunk당 하나의 stream 연결을 생성합니다. 그리고 각각의 chunk들을 stream을 통해 전송합니다. chunk의 순서 또한 TCP와 마찬가지로 클라이언트가 연결을 시도하는 순서와 같습니다.&lt;/p&gt;
&lt;p&gt;주어진 chunk를 모두 전송하게 되면 EOF를 전송한 후 QUIC stream이 닫히게 되고, 모든 전송이 완료되어 모든 stream이 닫히게 되면 QUIC 연결이 끊어집니다. 이 부분에서 클라이언트에 모든 데이터들이 도달하기 전에 서버는 모든 파일을 전송 완료 후 stream을 닫아버리게 되면 QUIC 연결이 끊기고 마지막 1~2개정도의 버퍼가 클라이언트에 도달하지 않는 문제가 있으므로, 모든 전송이 완료되었는지를 확인할 수 있는 별도의 stream을 열어 클라이언트가 모든 파일 수신을 완료한 후 보내는 신호를 수신 후 stream을 닫아 조기에 QUIC 연결이 끊어지는 것을 방지해야 합니다.&lt;/p&gt;
&lt;h4 id=&quot;--클라이언트-sw-설계-및-개발&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#--%ED%81%B4%EB%9D%BC%EC%9D%B4%EC%96%B8%ED%8A%B8-sw-%EC%84%A4%EA%B3%84-%EB%B0%8F-%EA%B0%9C%EB%B0%9C&quot; aria-label=&quot;  클라이언트 sw 설계 및 개발 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;- 클라이언트 SW 설계 및 개발&lt;/h4&gt;
&lt;p&gt;클라이언트는 서버에 연결 후 서버가 보내는 모든 chunk 파일들을 받습니다. 연결 시도 순서가 chunk 순서와 같기 때문에 연결 순서대로 받은 chunk에 똑같이 번호를 매긴후, 원래 파일과 같은 형태로 합쳐주는 과정을 거칩니다. 파일 수신 또한 서버와 마찬가지로 멀티스레딩을 통해 동시에 이루어집니다. 모든 클라이언트는 시작 시간부터 연결 완료 시간, 전송 완료 시간을 각각 로그를 통해 출력하여 작업 시간을 측정합니다.&lt;/p&gt;
&lt;p&gt;TCP(+TLS) 클라이언트는 서버와 동일한 수의 TCP(+TLS) 연결을 진행하며, 각각의 연결을 통해 서버가 보낸 EOF를 받으면 수신을 종료하고 연결을 닫습니다. 모든 파일을 다 수신하면 수신한 모든 chunk를 순서에 맞춰 하나로 합쳐줍니다. TCP+TLS 클라이언트에서 서버로 복수로 연결을 시도할 시, 앞서 연결된 TCP+TLS를 통해 데이터가 한번도 이동하지 않았다면 이후 시도되는 연결은 앞의 연결과 동일한 연결 시도(동일한 인증서 사용으로 인함)로 파악해 TCP+TLS 연결의 경우 매 연결 시작 직후 데이터를 전송해 다른 연결 시도임을 확인시킵니다.&lt;/p&gt;
&lt;p&gt;QUIC 클라이언트는 마찬가지로 서버와 동일한 수의 QUIC stream을 통해 파일 chunk들을 수신하게 되며, 추가로 stream을 하나 더 열어 모든 연결을 통해 서버로부터 EOF를 수신하고 정상적으로 파일을 모두 받은 것을 확인하면 서버를 닫아도 괜찮다는 신호를 보내게 됩니다. 그렇게 연결이 모두 끊어진 뒤, 받은 chunk들을 하나의 파일로 합칩니다.&lt;/p&gt;
&lt;p&gt;여기까지 실험 설계를 마쳤으니, 다음 글에서 실험 결과를 보도록 하겠습니다.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[TCP와 QUIC 파일 전송 성능 비교 (1)]]></title><description><![CDATA[0. 전송 계층 프로토콜의 간단한 이해 TCP와 UDP…]]></description><link>https://mintyu.github.io/capstone-quic_01/</link><guid isPermaLink="false">https://mintyu.github.io/capstone-quic_01/</guid><pubDate>Fri, 24 Jun 2022 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;0-전송-계층-프로토콜의-간단한-이해&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#0-%EC%A0%84%EC%86%A1-%EA%B3%84%EC%B8%B5-%ED%94%84%EB%A1%9C%ED%86%A0%EC%BD%9C%EC%9D%98-%EA%B0%84%EB%8B%A8%ED%95%9C-%EC%9D%B4%ED%95%B4&quot; aria-label=&quot;0 전송 계층 프로토콜의 간단한 이해 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;0. 전송 계층 프로토콜의 간단한 이해&lt;/h2&gt;
&lt;p&gt;TCP와 UDP의 비교에 앞서, 전송 계층 프로토콜이 무엇인지 그 개념을 정리해보겠습니다.&lt;/p&gt;
&lt;p&gt;컴퓨터 통신이 활성화되면서, 각각의 통신망 별로 각자 다 다른 규칙을 마련하는 것은 너무 비효율적이었습니다. 따라서 어떤 제조사에서 만든 통신 기기에서도, 어떤 네트워크인지 상관 없이 상호 통신이 가능한 규칙이 필요했습니다. 이 과정에서 OSI 7 Layer(OSI 7계층) 개념이 도입되었습니다.&lt;/p&gt;
&lt;p&gt;OSI 7 Layer에서는 통신이 일어나는 과정을 7단계로 나누고, 각 단계 별로 통신에서 수행하는 역할을 구분해 두었습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/bc5c6ba3c4c105e14ca8d3481f030e3c/f43e4/1.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 81.75675675675677%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAABYlAAAWJQFJUiTwAAADGklEQVQ4y32U2U8TURTG+58ZX/TFBx8wmmiMiZpokAeVRKMIKgiIikqIgkTQspTFjVI1WOABYuK+sInYbTprZ1pqS/fp9POeO04lJjrJL+eem7lnvnO+27pKJRNH6+5g155WHDrWhQNHbmH/4VuoOXAdO3e3Ymk5iHA4iFAohEAgwNZhhIIhBIP2HuWCIMA0TdDjKpfL8M99wVPfG0y9fA/vi3ecZ763jHfQYnHEYjEYhoFEIoF4PM7Z2NiApml8n8hkMnZBUnixbRS1p+/hTOMgw436hgc4dW4Ax0/24dtaGKoiIxqNVqEPUFQUBYZuQFVVZLNZu6BpluF5tICe+9O4757l9Ltn0D84i95+P6KiClmWIEkSizJEUeQFiFwuB8uyeLsUfxc00TvgR3vnM3R2P8f1Lh/nUttjtN6YRERQmRKZq6EiVNRZk1L6EEHF7ZZZwe67T3H5ykN03h7Htc5RzvmmflxsdrPBi+yAyA2hg6SQTCBobqVSCcVicYvCsgXP3CbuelPo8xFpHnumMujxZqHoWSSTthG6rnMzyATKHaMo5vN5u2CxZKH+poT9DQLqrkZwoj2C2rYIDl4QsfesitVAAuHgOrsm9hVxrg6ZQAopEn+uTaWCGXUTk2IKU+ImvGKasYkn0TQmpTQSuQLyOXaIzahQKPBZUaRWnXYJWldYLVeBtXzotY4dMyr2LWiomdewZz6Gba90bPcbWIqlEFcVKL9NIEMcUwgnp7a5QotVDWQMrKRULG7IWEzKPK6mVaxlNOTLJf5ihahU8L+HKzQrFnqEaTR/H0Pjpwdo+urGhY8DaAmMo0P2IpxUoMkMpk7dojKmxbhJzl7VZYsVlHJs8FkdkUK8Sjinsz3mnlmEyeZTKNrzIzcp/r0mdVxhmTXTK/nRvD6Bhg9M4edBNH5xo0V4gg7Fh2CCtb+4zC54hCsiNfQbprWjkHLncRWYQz9SMlaSApaMUJWVnwJWUyKSGeZ0PMGvBimgPxNqz8HJHcddw8PDGB8Zw4RnDI9Gx6tQPsH2PSMj8Hg8GGHvDQ0N/ROqQ/wCi6hkF9V0PeoAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;OSI Layer&quot;
        title=&quot;OSI Layer&quot;
        src=&quot;/static/bc5c6ba3c4c105e14ca8d3481f030e3c/fcda8/1.png&quot;
        srcset=&quot;/static/bc5c6ba3c4c105e14ca8d3481f030e3c/12f09/1.png 148w,
/static/bc5c6ba3c4c105e14ca8d3481f030e3c/e4a3f/1.png 295w,
/static/bc5c6ba3c4c105e14ca8d3481f030e3c/fcda8/1.png 590w,
/static/bc5c6ba3c4c105e14ca8d3481f030e3c/efc66/1.png 885w,
/static/bc5c6ba3c4c105e14ca8d3481f030e3c/f43e4/1.png 1120w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이 중, 전송 계층(Transport Layer)는 4계층에 속합니다. 전송 계층은 하위 계층에 신뢰할 수 있는 데이터 전송 서비스를 제공하는 것이 주 목적이라고 할 수 있습니다. 쉽게 말해서, 컴퓨터와 컴퓨터 간에 통신을 할 때 데이터가 의도한 대로 잘 갔는지 확인하는 역할을 전송 계층이 해준다는 의미입니다. 그리고 이런 전송 계층을 담당하는 대표적인 프로토콜엔 TCP와 UDP가 있습니다.&lt;/p&gt;
&lt;h3 id=&quot;--tcp&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#--tcp&quot; aria-label=&quot;  tcp permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;- TCP&lt;/h3&gt;
&lt;p&gt;TCP는 전송 제어 프로토콜(Transmission Control Protocol)의 약자로, 연결 지향적 프로토콜입니다. 따라서 데이터가 전송되기 전, 반드시 클라이언트와 서버의 연결이 이루어지며, 이 연결 과정은 3-Way Handshake 라고 합니다. Handshake가 가진 악수라는 뜻처럼, 네트워크 관련 분야에서는 통신이 시작되기 전 양 측이 악수하는 것처럼 통신 채널의 변수를 설정하는 일련의 과정을 의미합니다. &lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/ed84d03b4793e568de927d0694172347/09b15/2.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 93.24324324324323%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAATCAYAAACQjC21AAAACXBIWXMAABYlAAAWJQFJUiTwAAACP0lEQVQ4y31Ui5KiMBD0/7/u6u72XMUH6qr4WOSREBIgc9OBIKJuqqZAMumZ6e44IV51XVOapiSlxE+y1r4Mvzd8+rOCz+LbBB+N0bRcLuhwPNJPqwWmhwJaa1osAtofDndALGVqMlVDTWOpHsXlciEhcg/7VKjNa9z7JM9zN2quaioBWjcO2EdVW9rs9tz9ifcsaf42jIa7Urp2eQ4w3IS0Xq/okkiXgLaRhE6xVuuQprM5/fr9l06XmDjFTVN2UTDYTRgG7DqsmFTL7SZSUyoNybIiobrgd6kMHU9n+vgMeM8wSNMD4okmkNsDeh6wMeQNHXrihRAURRGLZ55c4Ph3I48As6Ktjg0EuEQRqTQdojP9+ZhSmgk3su460x3P+bBDyK6UYsDKjQH+fJfaVPRv+ulGRi+aiR/y18YIMIqOFLCPvjPlCK68yt0T2uQ8snhj+iFdDhD+gSgYubH2paELBisK9XBDhu9mCOi3oaj/aEfAFYvRdMb1xn7boT+aM4cYGZvgBVWxd/2OaR4saLlas7kj9208CejpAVHZWia2vEuPA62Sls7XG3twTnGSUSYKKrkQCsuyDW9s40XZ7jZOlPNNugQzsANWsFiy0jNahTtK86JVmkH64MJQue8wy7hyltIpFnxb7jcFlXF3r3FCX/sjzeYBFap0qkPIZmD+Bw574u1dLQRGgM+QlrChV+GWO6oGfxptIBdN9D4Eh7hW4fbrSd1XinuOPWAzBsSBnMc+XeMH29zjtZnfqfwf9oTPkCBylHQAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;TCP&quot;
        title=&quot;TCP&quot;
        src=&quot;/static/ed84d03b4793e568de927d0694172347/fcda8/2.png&quot;
        srcset=&quot;/static/ed84d03b4793e568de927d0694172347/12f09/2.png 148w,
/static/ed84d03b4793e568de927d0694172347/e4a3f/2.png 295w,
/static/ed84d03b4793e568de927d0694172347/fcda8/2.png 590w,
/static/ed84d03b4793e568de927d0694172347/efc66/2.png 885w,
/static/ed84d03b4793e568de927d0694172347/c83ae/2.png 1180w,
/static/ed84d03b4793e568de927d0694172347/09b15/2.png 1704w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;클라이언트에서 서버에 연결 요청을 하기 위해 SYN 데이터를 보낸다. &lt;/li&gt;
&lt;li&gt;서버에서 SYN 데이터를 받은 후 정상적으로 데이터를 수신했다는 대답(ACK)와 함께 클라이언트도 포트를 열어달라는 신호인 SYN을 함께 보낸다.&lt;/li&gt;
&lt;li&gt;클라이언트는 SYN, ACK 신호를 받고 포트를 열고 서버에 ACK를 전송하게 된다.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;TCP는 패킷이라는 가장 작은 단위로 데이터를 전송합니다. 데이터를 전송하는 과정에서 송신자(Sender)가 데이터를 패킷의 형태로 전송하면, 수신자(Reciever)는 데이터를 받은 후 ACK(확인)신호를 송신자에게 다시 전송하게 됩니다. 따라서 송신자는 수신자가 제대로 데이터를 받았는지 확인할 수 있으며, 만약 제대로 전송이 안되었다면 다시 전송해서 수신자가 모든 데이터를 다 받았음을 보장합니다. TCP는 전송하는 패킷에 Sequence number를 매겨서 전송 순서까지 보장할 수 있습니다. 이 과정에서 TCP는 신뢰성 있는 데이터를 전송할 수 있습니다.(빠진 데이터 없이 모두 전송되었는지, 순서는 맞게 전송되었는지 신뢰성을 보장)&lt;/p&gt;
&lt;h3 id=&quot;--udp&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#--udp&quot; aria-label=&quot;  udp permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;- UDP&lt;/h3&gt;
&lt;p&gt;UDP는 사용자 데이터그램 프로토콜(User Datagram Protocol)의 약자로, 비연결 지향적 프로토콜입니다. TCP의 전송 단위는 패킷인것처럼, UDP는 데이터그램(Datagram)의 단위로 데이터를 전송하게 됩니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 574px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/659f09b6c0610ee15f15e9a285865945/86389/3.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 69.5945945945946%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAAAsSAAALEgHS3X78AAABzElEQVQ4y41TXY+bMBDM//9ffehDe305VWqlNicCJAZjEwjfZLqzx0boPqSuRNYEe3dmdnzALkI3oupnXS/LgmmaNM/zjNvthmEY0Lat5o/ifr/jwJ+yLJFnGX7kAV/SiDpGlN7DVxW8ZH7P5PvlcsHFOX2v61oLd12HNE21OeMgdRFZQDZhXdgH67rq5nEc9YChIkoCMAZc932Poig0bwV3kHfZDlqQNv8jEq4/pUw0RBhCUESjICmqiG9/TnhxHrkPiPItz3OlzEw21+tVUfOpRBrWeSCkTtSINF7pjkjKgJ8u4m8R0NQRTdOiliLLNig2N8pJkqg07yj/T7Tj9CldLWi6sKNZhOu+F4vIENiZTyUsnk8Ov08ZQgwPLc1eFlrQiRUI2wpTH7MGtXUyxehLPCUOT1mFsWt1D/dz8sfj8eFNpWx+2k+WWhIBD00bEryZ/CvXVc8+fPjWHvbObEPi2nzJwZjnhmnGqRne24b0zufzpl2vNuLkSZeWoF0yuQ2p7Pn660X1vIocsb3hu9yuJMuxbN5UytSKRc39zHZ3mZum0UakZojnedkcP6GQGTxuykeUKbhzhWYvw3CCkGsi875SAKbZ/jjP/gMokEDzsK2gEgAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;UDP&quot;
        title=&quot;UDP&quot;
        src=&quot;/static/659f09b6c0610ee15f15e9a285865945/86389/3.png&quot;
        srcset=&quot;/static/659f09b6c0610ee15f15e9a285865945/12f09/3.png 148w,
/static/659f09b6c0610ee15f15e9a285865945/e4a3f/3.png 295w,
/static/659f09b6c0610ee15f15e9a285865945/86389/3.png 574w&quot;
        sizes=&quot;(max-width: 574px) 100vw, 574px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;TCP에서는 데이터의 전송을 위해 일단 연결을 먼저 수립한것과 반대로, UDP는 연결 과정을 거치지 않고 송신자가 그냥 일방적으로 데이터를 전송하게 됩니다. 연결 과정이 없기 때문에 TCP에 비해 빠른 속도로 데이터를 전송할 수는 있지만, 데이터 전달의 신뢰성은 떨어지게 됩니다. 또한, 직접 연결을 수립한 것이 아니기 때문에 각각의 데이터그램들은 모두 다른 경로로 독립적으로 전송되게 됩니다. 이 과정에서 어떤 경로는 전송 속도가 빠르고, 어떤 경로는 느린 경우가 존재하기 때문에 먼저 전송했다고 먼저 도착할것이라고 보장할 수가 없습니다. 따라서 UDP는 데이터 전송의 순서는 보장받을 수 없습니다.&lt;/p&gt;
&lt;h2 id=&quot;1-어떤-프로토콜을-이용하는-것이-좋은가&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#1-%EC%96%B4%EB%96%A4-%ED%94%84%EB%A1%9C%ED%86%A0%EC%BD%9C%EC%9D%84-%EC%9D%B4%EC%9A%A9%ED%95%98%EB%8A%94-%EA%B2%83%EC%9D%B4-%EC%A2%8B%EC%9D%80%EA%B0%80&quot; aria-label=&quot;1 어떤 프로토콜을 이용하는 것이 좋은가 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;1. 어떤 프로토콜을 이용하는 것이 좋은가?&lt;/h2&gt;
&lt;p&gt;그렇다면 어떤 프로토콜을 이용해야 하는가?&lt;/p&gt;
&lt;p&gt;예상했겠지만, TCP와 UDP 모두 각각의 특징이 뚜렷하기 때문에 무엇을 사용하는 것이 정답이라고 할 수는 없습니다. 구현하고자 하는 서비스가 어떤 프로토콜을 이용했을 때 잘 맞을지 생각해보고, 이를 적용하는 것이 최선이라고 할수 있습니다. &lt;/p&gt;
&lt;p&gt;가령, 파일을 전송한다고 해봅시다. 파일을 전송하는 경우 사용자의 입장에서 물론 파일의 전송이 빠르면 빠를 수록 좋을 것입니다. 하지만, 전달받은 데이터가 뒤죽박죽이 되거나 중간중간 일부분이 전송 실패로 온전하지 못한 파일을 전송받기는 싫을 것입니다. 아무리 속도가 빠르다고 해도, 열수도 없는 파일을 전달받고자 하는 사용자는 아무도 없을 것입니다. 이 경우, 속도에서 어느 정도 타협이 있더라도 전달하고자 하는 파일에 대한 신뢰성을 보장받을 수 있어야 하기 때문에 TCP를 사용하는 것이 더 적합하다고 볼 수 있습니다.&lt;/p&gt;
&lt;p&gt;UDP의 경우는 신뢰성있는 데이터의 전송은 보장해주지 못하기 때문에, 신뢰성이 중요한 파일 전송과 같은 서비스보다는 속도가 중요한 스트리밍 서비스 등에 이용할 수 있습니다.&lt;/p&gt;
&lt;h2 id=&quot;2-성능-향상&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#2-%EC%84%B1%EB%8A%A5-%ED%96%A5%EC%83%81&quot; aria-label=&quot;2 성능 향상 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;2. 성능 향상?&lt;/h2&gt;
&lt;p&gt;사실 TCP와 UDP 모두 설계되고 사용된지 상당히 오래된 프로토콜이고, 그에 따라 프로토콜의 성능을 향상시키려는 시도가 많이 있었습니다. 특히 TCP의 단점인 느린 속도를 커버하기 위해 다양한 노력들이 있었지만, TCP의 느린 속도의 주범이라고 지목받은 3-Way Handshake를 근본적으로 해결할 수가 없었습니다.&lt;/p&gt;
&lt;p&gt;또한, TCP는 HOL Blocking(Head Of Line Blocking)이라는 문제를 안고 있었습니다. TCP를 사용한 HTTP의 전송 지연 시간을 줄이기 위해 Pipelining을 통해 한 번에 여러 개의 요청을 보내게 되면, ACK가 도착하기 전까지 발생하는 지연을 줄일 수가 있었습니다. 하지만, 가장 먼저 전송된 요청에 대해 처리가 늦어지면 그 뒤에 전송된 요청들에 대한 처리도 함께 밀리게 되는 상태가 됩니다. 이를 HOL Blocking이라고 하게 됩니다. Pipelining을 통해 지연 시간은 조금 줄일 수 있어도 TCP를 사용하는 이상 HOL Blocking에 대한 문제는 안고 갈 수밖에 없었습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/8bc92926da708d6d19109c9ed12589d6/8c857/4.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 76.35135135135135%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAIAAABr+ngCAAAACXBIWXMAAAsSAAALEgHS3X78AAABwElEQVQoz42TWZLbIBCGfe7kFDNnmBPkKdeIX1yObcmDJCTEAELseyS54iw1nqSLoijor/lpuncqqYM52GKpovvLXkdd/jQAgBBiWeSc/zra+exJIKlkPfGmucp58kpzxkaE0uY9DIO1trxnu7LFXCbLRAe6NX4MVswcY8lYibEF4D/gSYBrM6skVYTDyCZeUoxavZ5PkvOS00357+J/wYbNsGljyt4nQjVlZhbRhdL1iA6oh0PfDxiPSsmH8P1g2QoxWVdOl66ph3FUEFn8pkY881lbY2KMD+Bt5yawudaoJ0rHkcYW8qoeOogZoyGEFU5pfY8hvNvgHKNXymstKYHX+ng4TrPVNhEWhHQDwggtMFtla62Px6NZEjqJG7w8+/r5U/f81CI8YgmaXmkT05L4vFzjnN2GW2GEUFVVb4TcYc0ofHkBX77OriQXmrqym+s/vqpvuxVWnrvkUjKMBaO7rnv4z6sS63wIApPXGkwiKZ2CVLgBsF2F9H3/EYwJWQY4V6fvFyWMogRUFaE0xLh4QAgfwjmEtNW9pRzWtWXUe++8v3fChzcLYQ6HrJTj/Pxtn3/63StxSaeU8t2u+gHs3V8XMQf6TQAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;HOL&quot;
        title=&quot;HOL&quot;
        src=&quot;/static/8bc92926da708d6d19109c9ed12589d6/fcda8/4.png&quot;
        srcset=&quot;/static/8bc92926da708d6d19109c9ed12589d6/12f09/4.png 148w,
/static/8bc92926da708d6d19109c9ed12589d6/e4a3f/4.png 295w,
/static/8bc92926da708d6d19109c9ed12589d6/fcda8/4.png 590w,
/static/8bc92926da708d6d19109c9ed12589d6/8c857/4.png 761w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;사실상 3-Way Handshake와 HOL Blocking 문제를 근본적으로 해결하기 위해서는 TCP를 포기해야만 했고, 그로 인해 구글은 UDP로 눈을 돌립니다.&lt;/p&gt;
&lt;h2 id=&quot;3-quic의-탄생&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#3-quic%EC%9D%98-%ED%83%84%EC%83%9D&quot; aria-label=&quot;3 quic의 탄생 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;3. QUIC의 탄생&lt;/h2&gt;
&lt;p&gt;QUIC은 UDP를 기반으로 한 전송 계층 프로토콜로, 구글의 짐 로스킨드가 처음 설계한 프로토콜입니다.&lt;/p&gt;
&lt;p&gt;QUIC이 생소할 수도 있지만, 이미 구글은 많은 서비스에서 이를 사용하고 있습니다.&lt;/p&gt;
&lt;p&gt;주소창에 &lt;code class=&quot;language-text&quot;&gt;chrome://flags/&lt;/code&gt;로 들어가면 다양한 기능을 사용할 수 있습니다. 이 중, &quot;Experimental QUIC protocol&quot; 항목을 활성화시켜주시면, QUIC 프로토콜을 사용한 전송을 허용할 수 있습니다. 만약 인터넷을 사용하는데 들어간 사이트가 QUIC 프로토콜을 지원한다면, QUIC을 통한 전송을 허용하는 것입니다. 아직 완벽한 상태는 아니기에 크롬에서도 실험항목으로 남겨둔 것 같습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/424865b2bdd6ed55e7ed502df8ae5f38/df438/5.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 52.02702702702703%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAKCAYAAAC0VX7mAAAACXBIWXMAABYlAAAWJQFJUiTwAAABvklEQVQoz2VSy27bMBD0jyQtLNF6URT1jh9y4liWrMhuYgO9FQWKFkX7Bb23p371dJeyjBQ5jGYpkaPZWU5cT8KyBIRwMJ3auLl5h9vb92/A76e0z/MVIfwPcV4hzpbwSGviuAHitECSlYZd2iBmHoTjX+AN65mLGa3ZwBV0ljmMMiidme8TfnRNh6ZusN/tkScZfNoYBiGkLyEDBaliqCgxCKmWob6yTnKkZEbpBLZwySEJfmha9CTYb3d4bjsc6AfMp67H6fiC55czuqcj2u6APaFpn67MaPc98mIOy3YGh9VDjeV6Q3gcuHrA6n6oV+stcU0uU9PemNvYNtc+dRHIaHDI+UQ6hbq0oKjWSWFYSmWEVLpEqMtrZqPIa2GGEeSHS5Nz/QiOFyKIcmTLxgj4gYYv48thaZih4wzlfIUk5T0RAjLCMIIsVi7ukd9VKOZrFMQ8tYCcSWIDlZmWRsEBdBusKYRtwSaeCTEIsgMOu6YJP24brCm7NKcrRNPjqTInaW5cjdeLaxWXOH/7g9PX3/j48y8On36NGbq4W1Q0pYW5gzyk1xm9hTJ5cbb16Ts2xy/YnX9g0382gv8AYFE9DOnTGhsAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;QUIC Enable&quot;
        title=&quot;QUIC Enable&quot;
        src=&quot;/static/424865b2bdd6ed55e7ed502df8ae5f38/fcda8/5.png&quot;
        srcset=&quot;/static/424865b2bdd6ed55e7ed502df8ae5f38/12f09/5.png 148w,
/static/424865b2bdd6ed55e7ed502df8ae5f38/e4a3f/5.png 295w,
/static/424865b2bdd6ed55e7ed502df8ae5f38/fcda8/5.png 590w,
/static/424865b2bdd6ed55e7ed502df8ae5f38/efc66/5.png 885w,
/static/424865b2bdd6ed55e7ed502df8ae5f38/c83ae/5.png 1180w,
/static/424865b2bdd6ed55e7ed502df8ae5f38/df438/5.png 1556w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;그러고 유튜브로 들어가서 프로토콜을 확인하면 h3(HTTP/3) 프로토콜이 적용된 것을 확인할 수 있습니다. HTTP/3 버전의 표준이 QUIC 프로토콜이므로 잘 적용된 것을 확인할 수가 있습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/07acd5512e20e6652b5f26bb5d8026ed/6274f/6.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 60.13513513513513%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAMCAYAAABiDJ37AAAACXBIWXMAABYlAAAWJQFJUiTwAAACLklEQVQoz11TW3LbMAz0SRpbIkU9KFFv6mnZjp2kTfvVv97/HtsllWSm/cAAIoHFYkEdVJyiqhqM/YCi6mGqFk03oGkt0ixHXlSw4+LP2o/zkjmFqVHwrm87dE3l47rpcZAqQZLoPbGxPtEBjNMZy3rFdr1jcIC8323wgF/fbJLpEkma0xscno4hTmEEoVKoOEMgFCLneebik7Mg8t8uV8h491GCb08BQhHjeAohWX8KJA7uI2SRJjMZxR5MSOU7plkBw7GzbI91XjKHEzGOOZU/46hxqn0jR4iAwiP3dsJ6vmBaVkqQMNHA1hX+aIHV5OjHGV3XEzSHKUvq1bBmQFU3UCr2Nc4IKBEEAZapxe3+StCNS6qRpimFNsi8L6hb7Rt+//GTmk7U2qA0JfK8gBAhoiiClHJnGAQhBttgmldustu7EcglO+9YlVXlga7Pd7TcrNZk6gDpHZBSyoPuI1PHZRmwXW6+0F1qrX1BlmWejWt03sjw/RdHtbwzaJoWhlNIIf4FPFLD2o6Y1w3DtHgr+TbdMpI0817rgvqeyfBBcEvWhZ/AMRUyQkQdZaQ+AGnDdMOwPiDiEiepEWsCGguVUc+ig13umLYXvLz/Rt1zcXmLsp2Z0+MoUgQqRxDpzy2H6PsWXW/5OLWn7/yXhozdUma+gIVTOFbuvqI8OUcW/2vo3uJIwR8vb7jcnvF4feOfMu9bpoY5t+zi6+3uF+MK7TBSy5FPqPZb/gT8CzIdcugTavDkAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;youtube&quot;
        title=&quot;youtube&quot;
        src=&quot;/static/07acd5512e20e6652b5f26bb5d8026ed/fcda8/6.png&quot;
        srcset=&quot;/static/07acd5512e20e6652b5f26bb5d8026ed/12f09/6.png 148w,
/static/07acd5512e20e6652b5f26bb5d8026ed/e4a3f/6.png 295w,
/static/07acd5512e20e6652b5f26bb5d8026ed/fcda8/6.png 590w,
/static/07acd5512e20e6652b5f26bb5d8026ed/efc66/6.png 885w,
/static/07acd5512e20e6652b5f26bb5d8026ed/c83ae/6.png 1180w,
/static/07acd5512e20e6652b5f26bb5d8026ed/6274f/6.png 2098w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;지금까지 HTTP는 TCP를 기반으로 작성되었고, 앞서 말한 TCP의 고질적인 문제를 고칠 수 없었기에 구글은 QUIC 프로토콜을 제안하였으며 이는 HTTP/3의 표준으로 인정받게 됩니다.&lt;/p&gt;
&lt;p&gt;여기까지 전송 계층 프로토콜에 대한 설명과, QUIC의 탄생 배경에 대해서 알아보았습니다. 다음 글에서는 어떻게 UDP를 통한 QUIC으로 신뢰성있는 전송이 필요한 HTTP 프로토콜의 기반으로 쓰일 수 있었는지, 그리고 QUIC 프로토콜의 특징은 무엇인지 알아보도록 하겠습니다.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[카카오톡 봇 제작기 (1) - 구동 원리, 준비물]]></title><description><![CDATA[카카오톡 봇 게임 캐릭터 정보를 확인하기 위해 따로 인터넷 검색을 해야 하는 번거로운 작업을 줄이고, 간단한 편의성 기능을 제공하기 위해 개인적으로 사용할 목적의 카카오톡 봇을 만들어보기로 결심했습니다. 봇 최초 컴파일(3월 7일) 이후…]]></description><link>https://mintyu.github.io/KakaoBot_1/</link><guid isPermaLink="false">https://mintyu.github.io/KakaoBot_1/</guid><pubDate>Tue, 22 Mar 2022 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;카카오톡-봇&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%B9%B4%EC%B9%B4%EC%98%A4%ED%86%A1-%EB%B4%87&quot; aria-label=&quot;카카오톡 봇 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;카카오톡 봇&lt;/h2&gt;
&lt;p&gt;게임 캐릭터 정보를 확인하기 위해 따로 인터넷 검색을 해야 하는 번거로운 작업을 줄이고, 간단한 편의성 기능을 제공하기 위해 개인적으로 사용할 목적의 카카오톡 봇을 만들어보기로 결심했습니다.&lt;/p&gt;
&lt;p&gt;봇 최초 컴파일(3월 7일) 이후 2주가 지난 지금, 어느정도 봇도 안정화되었고 개발 과정에 대해 궁금해하시는 분들이 있어서 개발 과정을 간단하게 공개해보려 합니다.&lt;/p&gt;
&lt;h3 id=&quot;작동-원리&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%9E%91%EB%8F%99-%EC%9B%90%EB%A6%AC&quot; aria-label=&quot;작동 원리 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;작동 원리&lt;/h3&gt;
&lt;p&gt;지금까지 카카오톡 단체 채팅방이나 오픈 톡방에서 방장봇을 제외한 사설 봇을 보았다면, 아마 99.9% 모두 같은 원리로 동작하는 봇들입니다.&lt;/p&gt;
&lt;p&gt;한 때 카카오톡의 통신 프로토콜인 &lt;code class=&quot;language-text&quot;&gt;LOCO Protocol&lt;/code&gt;(2011년 10월 13일 릴리즈, 기존 HTTP/TCP 프로토콜에서 자체 개발한 LOCO 프로토콜로 변경)을 통해 카카오톡 클라이언트인 척 메시지를 보내는 봇들도 있었으나, 해당 부분은 보안에 취약하기도 하고 약관에 위배되기 때문에 해당 방법을 이용한 봇들을 사용하면 계정이 정지당할 수 있습니다.&lt;/p&gt;
&lt;p&gt;이 때문에 현재 카카오톡 봇 앱을 통해 구동되는 봇들이 대부분이라고 볼 수 있습니다.&lt;/p&gt;
&lt;p&gt;봇 구동을 위해 많이 사용하는 앱은 &lt;code class=&quot;language-text&quot;&gt;메신저봇 R&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;채팅 자동응답 봇&lt;/code&gt;입니다. 제가 사용한 앱은 &lt;code class=&quot;language-text&quot;&gt;메신저봇 R&lt;/code&gt;이며, 앞으로의 설명들은 모두 해당 앱을 기준으로 진행됩니다.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://play-lh.googleusercontent.com/I5QjPGUfiYmHcMqmKdIPj7AOIIADGsMcSV_eibFPUxjPUDq1R_7cDycJ7kevpi862w=s360-rw&quot; alt=&quot;메신저봇 R&quot;&gt;&lt;/p&gt;
&lt;p&gt;해당 앱의 작동 원리는 아래와 같습니다.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;NotificationListenerService&lt;/code&gt;를 통해 알림을 파싱한다.&lt;/li&gt;
&lt;li&gt;Rhino JavaScript Engine을 사용하여 사용자가 작성한 자바스크립트로 알림 정보를 전달하여 처리한다.&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;WearableExtender&lt;/code&gt;를 통해 답장을 전달한다.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;1번에서 받게 되는 알림의 내용에는 채팅이 온 방의 이름, 메시지 내용, 전송자의 이름 등의 정보가 있습니다.&lt;/p&gt;
&lt;p&gt;해당 정보들을 받으면 2번 과정에서 사용자가 작성한 자바스크립트의 내용에 따라 정보를 처리하게 됩니다. (인터넷에서 해당 정보를 찾아온다던가, 결과값을 더 예쁘게 보일 수 있도록 처리)&lt;/p&gt;
&lt;p&gt;그리고 해당 결과를 다시 채팅방에 reply 해주면서 챗봇이 동작하게 됩니다.&lt;/p&gt;
&lt;h3 id=&quot;준비물&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%A4%80%EB%B9%84%EB%AC%BC&quot; aria-label=&quot;준비물 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;준비물&lt;/h3&gt;
&lt;p&gt;애석하게도, 현재 카카오톡의 정책 때문에 준비할것이 조금 있습니다.&lt;/p&gt;
&lt;p&gt;저는 공기계(갤럭시 S9+ 기기)를 통해 봇을 동작시킬것이기 때문에, 하나의 본계정을 두개의 기기에 모두 로그인해서 사용할 수 없는 카카오톡의 정책 상 부계정을 하나 새로 만들었습니다.&lt;/p&gt;
&lt;p&gt;(공기계가 아닌 본인의 기기로 본계정을 통해 봇을 구동할 경우, 사용자의 도배나 악용으로 인해 제재를 받는 경우 대처하기가 어려워집니다.)&lt;/p&gt;
&lt;p&gt;부계정을 만드는 방법에는 크게 두 가지 정도가 있습니다.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;통신사의 투넘버 서비스(번호 두개 이용)를 통해 계정을 하나 더 생성한다.&lt;/li&gt;
&lt;li&gt;가상 해외 번호를 통해 계정을 하나 더 생성한다.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;두 방법은 장단점이 있습니다. 1번 방식은 통신사에 추가 비용을 지불해야 한다는 단점이 있으며, 카카오톡 임시조치 제한을 받지 않는다는 장점이 있습니다(계정 바로 이용 가능)
하지만, 2번 방식은 앱을 통해 가상 해외 번호를 받아서 카카오톡 인증만 진행하면 되기 때문에 추가 비용이 들지 않는다는 장점이 있습니다. 하지만, 카카오톡의 정책으로 인한 임시조치 제한을 받습니다.&lt;/p&gt;
&lt;p&gt;여기서 임시조치 제한이란?&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;1) 짧은 기간 다수의 사용자가 카카오톡 친구로 추가되는 경우

2) 서비스 악용 유저가 활동한 서비스 환경과 유사한 환경에서 서비스를 사용한 경우

3) 불법 사행성 사이트 홍보 및 음란물 발송 이력이 있는 서비스 이용 환경에서 카카오톡을 사용한 경우

4) 짧은 기간 동안 채팅방 및 오픈 채팅방을 다수 개설한 경우

5) 오픈 채팅방 참여 속도 및 빈도가 비정상적으로 많을 경우

6) 통신사 확인이 되지 않는 해외 가상번호를 사용하는 경우

7) 메시지 신고 접수가 많이 된 경우

8) 오픈 채팅방 사용자 신고 접수가 증가할 경우

9) PC 에뮬레이터 등 비정상적인 환경에서 카카오톡을 사용할 경우 &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;해당 사례 중 한 가지, 혹은 두 가지 이상 해당되었을 때 적용되는 보호조치로, 오픈채팅을 이용하지 못하며 친구추가 수에 제한이 걸리게 됩니다.&lt;/p&gt;
&lt;p&gt;물론 임시로 적용되는 제한이기 때문에 일정 기간 정상적으로 카카오톡을 이용하는 것이 카카오톡 내부 매커니즘(봇에 의해 자동으로 이루어지는 것으로 추정)에 의해 확인이 된다면 해제되긴 합니다만, 언제 풀릴지는 정확히 알 수 없습니다.&lt;/p&gt;
&lt;p&gt;가상 해외 번호를 통해 인증한 경우 거의 무조건 임시조치 제한에 걸리게 되니 참고해주세요.(6번 내용에 해당)&lt;/p&gt;
&lt;p&gt;저는 급한 프로젝트가 아니었으므로 해외 번호로 가입 후 제한이 풀리기를 기다렸습니다.&lt;/p&gt;
&lt;p&gt;공기계와 카카오톡 부계정, 이 두 개가 준비되면 일단 1차적인 준비는 끝입니다. 이제 다음 게시물부터는 본격적으로 봇 구동 과정을 보여드리도록 하겠습니다.&lt;/p&gt;</content:encoded></item><item><title><![CDATA[자연어 처리(NLP) 실습 - LSTM]]></title><description><![CDATA[LSTM(Long Short Term Memory) 실습 Kaggle의 영화 리뷰에 대한 감정 예측 데이터셋에 대해 LSTM으로 자연어차리 실습을 해보도록 하겠습니다. Kaggle 링크를 통해 실습 데이터셋을 받을 수 있습니다. Data…]]></description><link>https://mintyu.github.io/LSTM/</link><guid isPermaLink="false">https://mintyu.github.io/LSTM/</guid><pubDate>Mon, 07 Jun 2021 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;lstmlong-short-term-memory-실습&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#lstmlong-short-term-memory-%EC%8B%A4%EC%8A%B5&quot; aria-label=&quot;lstmlong short term memory 실습 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;LSTM(Long Short Term Memory) 실습&lt;/h2&gt;
&lt;p&gt;Kaggle의 영화 리뷰에 대한 감정 예측 데이터셋에 대해 LSTM으로 자연어차리 실습을 해보도록 하겠습니다.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only/&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;Kaggle 링크&lt;/a&gt;를 통해 실습 데이터셋을 받을 수 있습니다.&lt;/p&gt;
&lt;p&gt;Data 탭에서 &lt;code class=&quot;language-text&quot;&gt;train.tsv.zip&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;test.tsv.zip&lt;/code&gt; 을 내려받아 압축을 풀면 각각 train set과 test set의 tsv 파일을 얻을 수 있습니다.&lt;/p&gt;
&lt;p&gt;해당 데이터셋에서 감정을 나타내는 label들의 의미는 다음과 같습니다.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;0 : negative&lt;/li&gt;
&lt;li&gt;1 : somewhat negative&lt;/li&gt;
&lt;li&gt;2 : neutral&lt;/li&gt;
&lt;li&gt;3 : somewhat positive&lt;/li&gt;
&lt;li&gt;4 : positive&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;데이터-가져오기&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%8D%B0%EC%9D%B4%ED%84%B0-%EA%B0%80%EC%A0%B8%EC%98%A4%EA%B8%B0&quot; aria-label=&quot;데이터 가져오기 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;데이터 가져오기&lt;/h3&gt;
&lt;p&gt;우선 실습에 필요한 library들을 불러옵니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; numpy &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; np
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; pandas &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; pd
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; matplotlib &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; pyplot &lt;span class=&quot;token keyword&quot;&gt;as&lt;/span&gt; plt
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;style&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;use&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;dark_background&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; keras&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;preprocessing&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;text &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Tokenizer
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; keras&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;preprocessing&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;sequence &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; pad_sequences
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; sklearn&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;model_selection &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; train_test_split
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; keras&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;utils&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;np_utils &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; to_categorical
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; keras&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;models &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Sequential
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; keras&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;layers &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Dense&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Dropout&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; Embedding&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; LSTM&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; GlobalMaxPooling1D&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; SpatialDropout1D&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code class=&quot;language-text&quot;&gt;to_categorical&lt;/code&gt;은 keras의 버전에 따라 &lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; keras&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;utils&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;np_utils &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; to_categorical&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;혹은&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; keras&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;utils &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; to_categorical&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;둘 중 동작하는 것으로 진행하면 됩니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;df_train &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;read_csv&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;train.tsv&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; sep&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;\t&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;train set: {0}&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;df_train&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;shape&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
df_train&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;head&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;우선 train 데이터셋을 불러왔습니다. tsv 파일이므로 &lt;code class=&quot;language-text&quot;&gt;sep=&amp;#39;\t&amp;#39;&lt;/code&gt;옵션으로 데이터를 불러옵니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;df_test &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; pd&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;read_csv&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;test.tsv&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; sep&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;\t&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;test set: {0}&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;df_test&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;shape&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
df_test&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;head&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;마찬가지로, test 데이터셋도 불러와줍니다.&lt;/p&gt;
&lt;h3 id=&quot;데이터-전처리&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%8D%B0%EC%9D%B4%ED%84%B0-%EC%A0%84%EC%B2%98%EB%A6%AC&quot; aria-label=&quot;데이터 전처리 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;데이터 전처리&lt;/h3&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;replace_list &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;r&quot;i&apos;m&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;i am&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;&apos;re&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; are&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;let’s&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;let us&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;&apos;s&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;token string&quot;&gt;&apos; is&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;&apos;ve&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; have&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;can&apos;t&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;can not&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;cannot&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;can not&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;shan’t&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;shall not&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;n&apos;t&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; not&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;&apos;d&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; would&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;&apos;ll&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; will&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;r&quot;&apos;scuse&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;excuse&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;&apos;,&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; ,&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;&apos;.&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; .&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;&apos;!&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; !&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;&apos;?&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; ?&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;token string&quot;&gt;&apos;\s+&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; &apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;clean_text&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    text &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; text&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;lower&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; s &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; replace_list&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        text &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; text&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;replace&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;s&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; replace_list&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;s&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    text &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos; &apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;join&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;text&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;split&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; text&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;원활한 학습을 위해 전처리를 진행해줍니다. &lt;code class=&quot;language-text&quot;&gt;i&amp;#39;m&lt;/code&gt; 과 &lt;code class=&quot;language-text&quot;&gt;i am&lt;/code&gt;은 같은 단어이지만 다르게 인식될 수 있으므로 형태를 통일해주고, 단어 뒤에 문장 부호가 붙어있는 경우에도 다른 단어로 인식되므로 빈 칸을 추가하여 단어와 문장 부호가 서로 다른 단어로 인식될 수 있게끔 전처리를 해줍니다. 또한, 대문자와 소문자의 차이도 제거하기 위해 모든 단어의 글자들을 &lt;code class=&quot;language-text&quot;&gt;.lower()&lt;/code&gt;함수를 통해 소문자로 만들어 줍니다.&lt;/p&gt;
&lt;p&gt;아래 코드를 통해 적용해줍니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;X_train &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; df_train&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Phrase&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;lambda&lt;/span&gt; p&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; clean_text&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;p&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;br/&gt;
이제, corpus(말뭉치)에 있는 구문의 길이를 시각화해보겠습니다.
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;phrase_len &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; X_train&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token keyword&quot;&gt;lambda&lt;/span&gt; p&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;p&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;split&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos; &apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
max_phrase_len &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; phrase_len&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;max phrase len: {0}&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;max_phrase_len&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;figure&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;figsize &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;hist&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;phrase_len&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; alpha &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0.2&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; density &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token boolean&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;xlabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;phrase len&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;ylabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;probability&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;grid&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;alpha &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0.25&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;max phrase len: 53&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/8aa599c5ed592dc8ea4ee26ea0ba2569/f6b72/1.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 78.37837837837837%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAAAsTAAALEwEAmpwYAAABe0lEQVQ4y61Ua1PCMBBM+khfSQthivIUUMbKw///79a72CCt4Aj64aabuct277KJMMag0NpFlmUu0jR1cQ1fy5VlCZEohbIeYbZ7haKEEOLu0CRKqCjG4GGM5XGHtMg7BUEQQEr5DfOX1x774G6FLjQR1nh6J0JddArCMDwRnWP+8tpjH44wkAEG4xqLwxtU3m3ZK+rjS+tTy5UpW4X7/1EYEuAZssKkN8PfKvTYKczSDEM+FD9D8aXQK+rjazmnMI4i1/Li+HeFjlDF8Ymw78O7CI3+tA370NghFDleng3/5pYjOi2n8NBgTrdl8rLpGFvIGw/FaHNSuNg3mDVb5FWJmK6kK243/WQbn2uNLVGxsYmQFc6JlHG9XqKwA8RpQirlRSNfNDa/EJZsM91uMKV2l6Rw8rym1tdY7RpMNivY6aOrqUYWKd2mjNzAmxV1kdAP8zx32BFGZBsOPzN+fbwaLuKctRZ1XUMlCR2YRETO4CdLtDeIny6/5wPcYVllo8xn+QAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;1&quot;
        title=&quot;1&quot;
        src=&quot;/static/8aa599c5ed592dc8ea4ee26ea0ba2569/fcda8/1.png&quot;
        srcset=&quot;/static/8aa599c5ed592dc8ea4ee26ea0ba2569/12f09/1.png 148w,
/static/8aa599c5ed592dc8ea4ee26ea0ba2569/e4a3f/1.png 295w,
/static/8aa599c5ed592dc8ea4ee26ea0ba2569/fcda8/1.png 590w,
/static/8aa599c5ed592dc8ea4ee26ea0ba2569/f6b72/1.png 615w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;가장 긴 문장의 길이는 53이며, 단어 길이의 분포는 위와 같습니다.&lt;/p&gt;
&lt;p&gt;신경망에 대한 모든 입력은 길이가 같아야합니다. 따라서 가장 긴 길이를 나중에 모델에 대한 입력을 정의하는 데 사용할 변수로 저장합니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;y_train = df_train[&amp;#39;Sentiment&amp;#39;]&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;감정 데이터가 있는 &lt;code class=&quot;language-text&quot;&gt;Sentiment&lt;/code&gt; 항목으로 train set을 만들어줍니다.&lt;/p&gt;
&lt;p&gt;이제 &lt;code class=&quot;language-text&quot;&gt;tokenizer&lt;/code&gt;를 통해 토큰화를 진행합니다.
또한, &lt;code class=&quot;language-text&quot;&gt;filters=&lt;/code&gt; 옵션을 통해 특수문자들을 제거해줍니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;max_words = 8192
tokenizer = Tokenizer(
    num_words = max_words,
    filters = &amp;#39;&amp;quot;#$%&amp;amp;()*+-/:;&amp;lt;=&amp;gt;@[\]^_`{|}~&amp;#39;
)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train, maxlen = max_phrase_len)
y_train = to_categorical(y_train)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code class=&quot;language-text&quot;&gt;to_categorical&lt;/code&gt; 함수를 통해, &lt;code class=&quot;language-text&quot;&gt;y_train&lt;/code&gt; 값을 원-핫 벡터로 바꿔주어 인코딩 과정도 진행해줍니다.&lt;/p&gt;
&lt;h3 id=&quot;training&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#training&quot; aria-label=&quot;training permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Training&lt;/h3&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;model_lstm &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; Sequential&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
model_lstm&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;add&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Embedding&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;input_dim &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; max_words&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; output_dim &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; input_length &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; max_phrase_len&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
model_lstm&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;add&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;SpatialDropout1D&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0.3&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
model_lstm&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;add&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;LSTM&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; dropout &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0.3&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; recurrent_dropout &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0.3&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
model_lstm&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;add&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Dense&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;256&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; activation &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;relu&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
model_lstm&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;add&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Dropout&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0.3&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
model_lstm&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;add&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;Dense&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; activation &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;softmax&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
model_lstm&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;compile&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;
    loss&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;categorical_crossentropy&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    optimizer&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Adam&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    metrics&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;accuracy&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;LSTM 레이어를 만들어 모델을 작성합니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;history &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; model_lstm&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;fit&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;
    X_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    y_train&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    validation_split &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0.1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    epochs &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;
    batch_size &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;512&lt;/span&gt;
&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code class=&quot;language-text&quot;&gt;epochs&lt;/code&gt;는 8, &lt;code class=&quot;language-text&quot;&gt;batch_size&lt;/code&gt;는 512로 설정하고 10%만큼을 validation set으로 이용하게 됩니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/c3e4499203aa9a9c8ee9e77f1c49ffc0/08485/2.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 41.891891891891895%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAICAYAAAD5nd/tAAAACXBIWXMAABYlAAAWJQFJUiTwAAAAu0lEQVQoz5WSzQ6EIAyEef9nFGKQg3AS5cdbN9NkDLrZxD00BTr9aAtmnmdZ11W2bZPWmvTe1c7z1P1xHFJr1T1j2FNLjzjMWGvFey8xxiu5lKJCrHPOsu+7rhnHGTTUco0cBS7LIimlm4iVoHKAIOZFuIDaUQ/7C8gqRiD9DRhC+AlEeyMQ5xwBqx5napxzCsQM3wBZ4RPIy8w0TV8tM/gEjjMkkPoLiG/DlpHIV6MYMySAfnxl5kAP+wCPiWWuG6ZLQgAAAABJRU5ErkJggg==&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;2&quot;
        title=&quot;2&quot;
        src=&quot;/static/c3e4499203aa9a9c8ee9e77f1c49ffc0/fcda8/2.png&quot;
        srcset=&quot;/static/c3e4499203aa9a9c8ee9e77f1c49ffc0/12f09/2.png 148w,
/static/c3e4499203aa9a9c8ee9e77f1c49ffc0/e4a3f/2.png 295w,
/static/c3e4499203aa9a9c8ee9e77f1c49ffc0/fcda8/2.png 590w,
/static/c3e4499203aa9a9c8ee9e77f1c49ffc0/efc66/2.png 885w,
/static/c3e4499203aa9a9c8ee9e77f1c49ffc0/c83ae/2.png 1180w,
/static/c3e4499203aa9a9c8ee9e77f1c49ffc0/08485/2.png 2014w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;train이 완료되었습니다. history에는 LSTM fitting 과정에서 각 epoch마다의 &lt;code class=&quot;language-text&quot;&gt;loss&lt;/code&gt;와 &lt;code class=&quot;language-text&quot;&gt;accuracy&lt;/code&gt; 데이터가 저장됩니다.&lt;/p&gt;
&lt;p&gt;loss와 accuracy 데이터를 시각화하면 다음과 같습니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;clf&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
loss &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; history&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;history&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;loss&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
val_loss &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; history&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;history&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;val_loss&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
epochs &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;loss&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;figure&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;figsize&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;plot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;epochs&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; loss&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;g&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; label&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Training loss&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;plot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;epochs&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; val_loss&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;y&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; label&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Validation loss&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;title&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Training and validation loss&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;xlabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Epochs&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;ylabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Loss&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;legend&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;show&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/8602c6bc6d00b4be1354a70d420f812a/d0d8c/3.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 81.75675675675677%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAAAsTAAALEwEAmpwYAAABWElEQVQ4y6WU227DIAyGDdhADl1UTdOiVO31+v4P6PFzSLs2q7r14o/B4C82JyIiFe91GAfd7XY6DIOGEFRENMaoy7Lo4XDQeZ71eDzm/jiOyszqUxzmMUuyPvVFCZ39+5t+fO71/HXW0+mUQfiRMWa1xlBtN7/JVsRo35sUYzKcrLXadVFZuP7Fr6DfBCgzpThK1ZBaW/zOOSUEI33ytGZxLeeKRAogxiKAbudnIDLq+0E5TerGJATcBMICCEDLZksZiE/woTilrdFFj0rfBOZdxmLC6f8G2ARO05RK7ouTq14Bdl13ydC8luV9yS1L/wIQtwK348egrVD+BzDvcgjbk/gKbJ8Elusjjye7cqRWuHsAxMHGxjxdmq1QqXKX7J2tJeMxgIVwDe/a7BR3vvnXseTnkGx0WT74sssYbFBY9LEMWFv4cE7xZMHmZysFtrEYYs4MFr5vpPRJZ12wigYAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;3&quot;
        title=&quot;3&quot;
        src=&quot;/static/8602c6bc6d00b4be1354a70d420f812a/fcda8/3.png&quot;
        srcset=&quot;/static/8602c6bc6d00b4be1354a70d420f812a/12f09/3.png 148w,
/static/8602c6bc6d00b4be1354a70d420f812a/e4a3f/3.png 295w,
/static/8602c6bc6d00b4be1354a70d420f812a/fcda8/3.png 590w,
/static/8602c6bc6d00b4be1354a70d420f812a/d0d8c/3.png 609w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;clf&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
acc &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; history&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;history&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;accuracy&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
val_acc &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; history&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;history&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;val_accuracy&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;figure&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;figsize&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;plot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;epochs&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; acc&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;g&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; label&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Training acc&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;plot&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;epochs&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; val_acc&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;y&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; label&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Validation acc&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;title&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Training and validation accuracy&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;xlabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Epochs&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;ylabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;Accuracy&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;legend&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;show&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/99415847bcfe539595404742c9465c10/f6b72/4.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 80.4054054054054%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAAAsTAAALEwEAmpwYAAABTElEQVQ4y62U23LDIAxEhUHga+xeJpNkMpPX9v8/cMsCjt2maZukDztGyByvjECMMQghoO1aDMOAvu9T7JzDNE04HA7Y7/c4Ho9pvNvtsN1uEXyAqsKrhzrNY+8htrJ4fpnw9Nrj7e0dp9MpAUUE/JiY8oxxmhOTZYs0qo55Lwh1WidomibS81eoNeBCNkpXYmxyjm7FWouu664DqKosji7ELYCvSiXTDR1+68oWiBao/Kz8D6NDbsbFCzNI/q4ErKoKbdt+Li9cL+tX4DiOqOt6KdnfBzsD2Wt0yFZIIL0PdgbObZOAtuziI0D23WazWcqVf3CYd8ckd9Yyvk2qFw5H2JhoBkEd8gu3iNAQqOJQ45FxraCyj5VMc8khd1ojnTfM+kxnuVQK+5X59dwS57nSz0v9PDXzM8MziH3KfuUCzjPPG4k5wuZ1HH8AjBBKy/tCbqwAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;4&quot;
        title=&quot;4&quot;
        src=&quot;/static/99415847bcfe539595404742c9465c10/fcda8/4.png&quot;
        srcset=&quot;/static/99415847bcfe539595404742c9465c10/12f09/4.png 148w,
/static/99415847bcfe539595404742c9465c10/e4a3f/4.png 295w,
/static/99415847bcfe539595404742c9465c10/fcda8/4.png 590w,
/static/99415847bcfe539595404742c9465c10/f6b72/4.png 615w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;</content:encoded></item><item><title><![CDATA[자연어 처리(NLP)의 전처리 - Word2Vec(워드투벡터) 실습]]></title><description><![CDATA[Word2Vec 실습(EN) 영어 데이터를 통해 Word2Vec을 학습시켜보도록 하겠습니다. 이라는 파이썬 패키지에 Word2Vec이 이미 구현되어 있으므로, 우리는 이를 따로 구현할 필요 없이 Word2Vec…]]></description><link>https://mintyu.github.io/Pytorch07/</link><guid isPermaLink="false">https://mintyu.github.io/Pytorch07/</guid><pubDate>Sat, 29 May 2021 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;word2vec-실습en&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#word2vec-%EC%8B%A4%EC%8A%B5en&quot; aria-label=&quot;word2vec 실습en permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Word2Vec 실습(EN)&lt;/h2&gt;
&lt;p&gt;영어 데이터를 통해 Word2Vec을 학습시켜보도록 하겠습니다. &lt;code class=&quot;language-text&quot;&gt;gensim&lt;/code&gt;이라는 파이썬 패키지에 Word2Vec이 이미 구현되어 있으므로, 우리는 이를 따로 구현할 필요 없이 Word2Vec을 사용할 수 있습니다.&lt;/p&gt;
&lt;p&gt;우선 필요한 도구들을 import해주겠습니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; nltk
nltk&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;download&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;punkt&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; urllib&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;request
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; zipfile
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; lxml &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; etree
&lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; re
&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; nltk&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;tokenize &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; word_tokenize&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; sent_tokenize&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id=&quot;훈련-데이터&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%ED%9B%88%EB%A0%A8-%EB%8D%B0%EC%9D%B4%ED%84%B0&quot; aria-label=&quot;훈련 데이터 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;훈련 데이터&lt;/h3&gt;
&lt;p&gt;사용할 훈련 데이터는, ted 영상들의 자막 데이터입니다. 파일의 형식은 xml 파일입니다.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&amp;#x26;filename=ted_en-20160408.zip&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;해당 링크&lt;/a&gt;를 통해 내려받고 압축을 풀어서 &lt;code class=&quot;language-text&quot;&gt;ted_en-20160408.xml&lt;/code&gt;라는 이름의 파일을 설치할 수도 있고, 파이썬 코드를 통해 자동으로 설치할 수도 있습니다. &lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token comment&quot;&gt;# 데이터 다운로드&lt;/span&gt;
urllib&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;request&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;urlretrieve&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;https://raw.githubusercontent.com/GaoleMeng/RNN-and-FFNN-textClassification/master/ted_en-20160408.xml&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; filename&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;ted_en-20160408.xml&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;br/&gt;
위의 코드를 통해 xml 파일을 내려받으면, 다음과 같은 파일을 볼 수 있습니다.
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;xml&quot;&gt;&lt;pre class=&quot;language-xml&quot;&gt;&lt;code class=&quot;language-xml&quot;&gt;&lt;span class=&quot;token prolog&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&lt;/span&gt;
&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;&lt;/span&gt;xml&lt;/span&gt; &lt;span class=&quot;token attr-name&quot;&gt;language&lt;/span&gt;&lt;span class=&quot;token attr-value&quot;&gt;&lt;span class=&quot;token punctuation attr-equals&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&quot;&lt;/span&gt;en&lt;span class=&quot;token punctuation&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;&lt;/span&gt;file&lt;/span&gt; &lt;span class=&quot;token attr-name&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;token attr-value&quot;&gt;&lt;span class=&quot;token punctuation attr-equals&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&quot;&lt;/span&gt;1&lt;span class=&quot;token punctuation&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;
  &lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;&lt;/span&gt;head&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;
    &lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;&lt;/span&gt;url&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;http://www.ted.com/talks/knut_haanaes_two_reasons_companies_fail_and_how_to_avoid_them&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;/&lt;/span&gt;url&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;

    ...

    &lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;&lt;/span&gt;content&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;Here are two reasons companies fail: they only do more of the same, or they only do what&apos;s new.

    ...

    So let me leave you with this. Whether you&apos;re an explorer by nature or whether you tend to exploit what you already know, don&apos;t forget: the beauty is in the balance.
    Thank you.
    (Applause)&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;/&lt;/span&gt;content&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;/&lt;/span&gt;file&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;&lt;/span&gt;file&lt;/span&gt; &lt;span class=&quot;token attr-name&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;token attr-value&quot;&gt;&lt;span class=&quot;token punctuation attr-equals&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&quot;&lt;/span&gt;2&lt;span class=&quot;token punctuation&quot;&gt;&quot;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;
  &lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;&lt;/span&gt;head&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;
    &lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;&lt;/span&gt;url&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;http://www.ted.com/talks/lisa_nip_how_humans_could_evolve_to_survive_in_space&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;/&lt;/span&gt;url&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;
    
    ...

    (Applause)&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;/&lt;/span&gt;content&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;/&lt;/span&gt;file&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token tag&quot;&gt;&lt;span class=&quot;token punctuation&quot;&gt;&amp;lt;/&lt;/span&gt;xml&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id=&quot;전처리&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%A0%84%EC%B2%98%EB%A6%AC&quot; aria-label=&quot;전처리 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;전처리&lt;/h3&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;targetXML&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;ted_en-20160408.xml&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;r&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; encoding&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;UTF8&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# xml 파일과 python 실행 위치가 같은 경로일 경우의 코드!&lt;/span&gt;
target_text &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; etree&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;parse&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;targetXML&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
parse_text &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;\n&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;join&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;target_text&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;xpath&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;//content/text()&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# xml 파일로부터 &amp;lt;content&gt;와 &amp;lt;/content&gt; 사이의 내용만 가져온다.&lt;/span&gt;

content_text &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; re&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;sub&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;r&apos;\([^)]*\)&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; parse_text&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 정규 표현식의 sub 모듈을 통해 content 중간에 등장하는 (Audio), (Laughter) 등의 배경음 부분을 제거.&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 해당 코드는 괄호로 구성된 내용을 제거.&lt;/span&gt;

sent_text &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; sent_tokenize&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;content_text&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 입력 코퍼스에 대해서 NLTK를 이용하여 문장 토큰화를 수행.&lt;/span&gt;

normalized_text &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; string &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; sent_text&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
     tokens &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; re&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;sub&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;r&quot;[^a-z0-9]+&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; string&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;lower&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
     normalized_text&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;append&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;tokens&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 각 문장에 대해서 구두점을 제거하고, 대문자를 소문자로 변환.&lt;/span&gt;

result &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
result &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;word_tokenize&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;sentence&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; sentence &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; normalized_text&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token comment&quot;&gt;# 각 문장에 대해서 NLTK를 이용하여 단어 토큰화를 수행.&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;전처리 과정 후, 샘플의 수를 출력하면 다음과 같습니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;총 샘플의 개수 : {}&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;result&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;총 샘플의 개수 : 273424&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;전처리가 되었으니, 이제 Word2Vec을 통해 텍스트 데이터를 학습시켜줍니다.&lt;/p&gt;
&lt;h3 id=&quot;word2vec-학습&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#word2vec-%ED%95%99%EC%8A%B5&quot; aria-label=&quot;word2vec 학습 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Word2Vec 학습&lt;/h3&gt;
&lt;p&gt;Word2Vec 훈련에 앞서, gensim 모듈이 없다면 설치해줍니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;$ pip install gensim&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;br/&gt;
설치가 완료되었다면, 아래의 코드를 통해 아까 전처리한 데이터를 Word2Vec 모델로 학습시켜줍니다.
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; gensim&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;models &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Word2Vec&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; KeyedVectors
model &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; Word2Vec&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;sentences&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;result&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; vector_size&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; window&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; min_count&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; workers&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; sg&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;여기서 Word2Vec의 하이퍼파라미터값은 다음과 같습니다.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;vector_size&lt;/strong&gt; = 워드 벡터의 특징 값. 즉, 임베딩 된 벡터의 차원.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;window&lt;/strong&gt; = 컨텍스트 윈도우 크기&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;min_count&lt;/strong&gt; = 단어 최소 빈도 수 제한 (빈도가 적은 단어들은 학습하지 않는다.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;workers&lt;/strong&gt; = 학습을 위한 프로세스 수&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;sg&lt;/strong&gt; = 0은 CBOW, 1은 Skip-gram.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Word2Vec에 대해서 학습을 진행하였습니다. Word2Vec는 입력한 단어에 대해서 가장 유사한 단어들을 출력하는 &lt;code class=&quot;language-text&quot;&gt;model.wv.most_similar&lt;/code&gt;을 지원합니다. &lt;/p&gt;
&lt;p&gt;earth와 유사한 단어는 어떤 것들이 있을까요?&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;model_result &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;wv&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;most_similar&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;earth&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;model_result&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;[(&amp;#39;planet&amp;#39;, 0.8294868469238281), 
(&amp;#39;mars&amp;#39;, 0.7876790165901184), 
(&amp;#39;surface&amp;#39;, 0.683158278465271), 
(&amp;#39;sun&amp;#39;, 0.6683340072631836), 
(&amp;#39;ocean&amp;#39;, 0.6607442498207092), 
(&amp;#39;moon&amp;#39;, 0.6605572700500488), 
(&amp;#39;continent&amp;#39;, 0.6459723711013794), 
(&amp;#39;universe&amp;#39;, 0.6119515895843506), 
(&amp;#39;galaxy&amp;#39;, 0.6062458753585815), 
(&amp;#39;orbit&amp;#39;, 0.6049789786338806)]&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;earth와 유사한 단어로 planet, mars, surface, sun 등 정말 그럴싸한 단어들을 내놓는 것을 볼 수 있었습니다.&lt;/p&gt;
&lt;h3 id=&quot;모델-저장-로드&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%AA%A8%EB%8D%B8-%EC%A0%80%EC%9E%A5-%EB%A1%9C%EB%93%9C&quot; aria-label=&quot;모델 저장 로드 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;모델 저장, 로드&lt;/h3&gt;
&lt;p&gt;이렇게 학습시킨 Word2Vec 모델은 저장해두었다가 나중에 다시 로드해서 사용할 수 있습니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;wv&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;save_word2vec_format&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;./eng_w2v&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;# 모델 저장&lt;/span&gt;
loaded_model &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; KeyedVectors&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;load_word2vec_format&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;eng_w2v&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;# 모델 로드&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;br/&gt;
&lt;p&gt;위의 &lt;code class=&quot;language-text&quot;&gt;loaded_model&lt;/code&gt;은 앞선 예제의 &lt;code class=&quot;language-text&quot;&gt;model.wv&lt;/code&gt; 변수와 동일하게 사용하면 됩니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;model_result &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; loaded_model&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;most_similar&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;earth&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;model_result&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</content:encoded></item><item><title><![CDATA[자연어 처리(NLP)의 전처리 - Word2Vec(워드투벡터)]]></title><description><![CDATA[Word2Vec(워드투벡터) 원-핫 인코딩에서 언급했듯, 원-핫 벡터와 같은 희소 표현(Sparce Representation…]]></description><link>https://mintyu.github.io/Pytorch06/</link><guid isPermaLink="false">https://mintyu.github.io/Pytorch06/</guid><pubDate>Wed, 12 May 2021 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;word2vec워드투벡터&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#word2vec%EC%9B%8C%EB%93%9C%ED%88%AC%EB%B2%A1%ED%84%B0&quot; aria-label=&quot;word2vec워드투벡터 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Word2Vec(워드투벡터)&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://mintyu.github.io/Pytorch04/&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;원-핫 인코딩&lt;/a&gt;에서 언급했듯, 원-핫 벡터와 같은 희소 표현(Sparce Representation)은 단어 간 의미의 유사도를 표현할 수 없습니다.&lt;/p&gt;
&lt;p&gt;그래서 단어 간 의미의 유사도를 표현할 수 있도록 단어를 벡터화 할 수 있는 방법이 필요합니다.&lt;/p&gt;
&lt;p&gt;이 방법을 &lt;strong&gt;분산 표현&lt;/strong&gt;(Distributed Representation)이라 합니다. 그리고, 이를 위해 사용되는 가장 대표적인 방식이 &lt;strong&gt;Word2Vec&lt;/strong&gt;이라 할 수 있습니다. &lt;/p&gt;
&lt;h2 id=&quot;분산-표현distributed-representation&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%B6%84%EC%82%B0-%ED%91%9C%ED%98%84distributed-representation&quot; aria-label=&quot;분산 표현distributed representation permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;분산 표현(Distributed Representation)&lt;/h2&gt;
&lt;p&gt;분산 표현(distributed representation) 방법은 기본적으로 분포 가설(distributional hypothesis)이라는 가정 하에 만들어진 표현 방법입니다. 이 가정은 &lt;strong&gt;&apos;비슷한 위치에서 등장하는 단어들은 비슷한 의미를 가진다&apos;&lt;/strong&gt;라는 가정입니다.&lt;/p&gt;
&lt;p&gt;분산 표현은 분포 가설을 이용하여 단어들의 집합을 학습하고, 벡터에 단어의 의미를 여러 차원에 분산하여 표현합니다.&lt;/p&gt;
&lt;p&gt;분산 표현으로 벡터화하게 되면, 원-핫 벡터처럼 벡터의 차원이 단어 집합의 크기와 같지 않아도 됩니다. 이에 따라 벡터의 차원은 원-핫 벡터로 표현할 때보다 상대적으로 줄어들게 됩니다.&lt;/p&gt;
&lt;p&gt;분산 표현 학습 방법에는 NNLM, RNNLM 등이 있지만, 해당 방법들의 속도를 개선한 Word2Vec이 많이 쓰입니다.&lt;/p&gt;
&lt;h2 id=&quot;cbowcontinuous-bag-of-words&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#cbowcontinuous-bag-of-words&quot; aria-label=&quot;cbowcontinuous bag of words permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;CBOW(Continuous Bag of Words)&lt;/h2&gt;
&lt;p&gt;Word2Vec에는 CBOW와 Skip-Gram 두 가지 방법이 있습니다. CBOW는 주변에 있는 단어들을 가지고 중간에 있는 단어들을 예측하는 방법이고, 반대로 Skip-Gram은 중간에 있는 단어로 주변 단어들을 예측하는 방법입니다. &lt;/p&gt;
&lt;p&gt;CBOW 방식의 예를 들면, 예문으로 &lt;code class=&quot;language-text&quot;&gt;&amp;quot;The fat cat sat on the mat&amp;quot;&lt;/code&gt;과 같은 문장이 있다고 할 때, &lt;code class=&quot;language-text&quot;&gt;{&amp;quot;The&amp;quot;, &amp;quot;fat&amp;quot;, &amp;quot;cat&amp;quot;, &amp;quot;on&amp;quot;, &amp;quot;the&amp;quot;, &amp;quot;mat&amp;quot;}&lt;/code&gt;으로부터 sat을 예측해내는 것입니다. 이 때 예측해야하는 단어 sat을 중심 단어(center word)라고 하고, 예측에 사용되는 단어들을 주변 단어(context word)라고 합니다.&lt;/p&gt;
&lt;p&gt;중심 단어를 예측하기 위해서 앞, 뒤로 몇 개의 단어를 볼지를 결정했다면 이 범위를 윈도우(window)라고 합니다. 예를 들어서 윈도우 크기가 2이고, 예측하고자 하는 중심 단어가 sat이라고 한다면 앞의 두 단어인 fat와 cat, 그리고 뒤의 두 단어인 on, the를 참고합니다. 윈도우 크기가 n이라고 한다면, 실제 중심 단어를 예측하기 위해 참고하려고 하는 주변 단어의 개수는 2n이 될 것입니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 568px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/ae196e343f7710e1df24e60a2d78db87/10e91/1.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 68.24324324324324%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAOCAYAAAAvxDzwAAAACXBIWXMAAA7DAAAOwwHHb6hkAAACeElEQVQ4y1VTTY/TMBTMjQMX+CHcuSE4Im5ISPAX+FeshPYDhFhxQdplu+yJC2cI7Tb9TpMmqZ04TtMmwxuXVIul6XOe+8Yzz7bXLhZoTYldC5jNDrbeoWnlQ0ZelTgdfnf4ENzgeNDb4/YaZ8Mbl+Pa0eASF9OfrsbDeg1bVqgbIC52yGyDbvjJBN7Jczw8f4P7n1/BO3oK790TeMfPcO/TSzw4fw3v7IXkH+PRl7d7wjZeYakqaFvDmgKlQCmFylpkhcLHvqj403M4+X2F41/fZH6NE/8Kp34PZzJ/71/i6+DHP8LVConZYpVXSJMEaZphJTmllSi30ocajay1xQY7iVWau+igreRrbFUJk2q00iovzGvxGiE2OyGukaUJYlGttYYxBnEiDmSdCKMl5uECs8Ucg+AWwXiE4TjAIgoRLpd7hWshKUwFm5fOti6Ms0wyIssyUZ1Kq9dIxEEYhljIQY5GI8xmM4zHY0RR5OAII1GWiR3xi8i2WAtxImryPEdZlo6ELSBY1O/3HQkRBAEmkwmGw6GLjnAVx4iKLVL2SYrjskGkDPQ6c4RURpWdQqqiQiqdz+cHxcuDZRYUFspssMkNcolJblHIgXSEBNtAlbRKIhJTFSPB3EEhTzcyDZQQ1ZnCXNqgquZgOeZ/pI+07Pu+s0jQfhen0+me0DXcWVJIrbwUUbnNRNGmlTtpDpZ56iRnIe11Khn5/Z9lLXaUzuWVCKERlUpD161TyA1plaQkpHXW3O1lURRufX8PJeF2kEhLM3l+caLlbi5hOsuST9ylT50itqA7dYLfjI6QPx2aO3OOuq7d7t2dpG1eJ86Zv7vGPMdfG3ELghHFxfUAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;1&quot;
        title=&quot;1&quot;
        src=&quot;/static/ae196e343f7710e1df24e60a2d78db87/10e91/1.png&quot;
        srcset=&quot;/static/ae196e343f7710e1df24e60a2d78db87/12f09/1.png 148w,
/static/ae196e343f7710e1df24e60a2d78db87/e4a3f/1.png 295w,
/static/ae196e343f7710e1df24e60a2d78db87/10e91/1.png 568w&quot;
        sizes=&quot;(max-width: 568px) 100vw, 568px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;윈도우 크기를 정했다면, 윈도우를 계속 움직여서 주변 단어와 중심 단어 선택을 바꿔가며 학습을 위한 데이터 셋을 만들 수 있는데, 이 방법을 슬라이딩 윈도우(sliding window)라고 합니다.&lt;/p&gt;
&lt;p&gt;위 그림에서 좌측의 중심 단어와 주변 단어의 변화는 윈도우 크기가 2일때, 슬라이딩 윈도우가 어떤 식으로 이루어지면서 데이터 셋을 만드는지 보여줍니다. 또한 Word2Vec에서 입력은 모두 원-핫 벡터가 되어야 하는데, 우측 그림은 중심 단어와 주변 단어를 어떻게 선택했을 때에 따라서 각각 어떤 원-핫 벡터가 되는지를 보여줍니다. 위 그림은 결국 CBOW를 위한 전체 데이터 셋을 보여주는 것입니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 559px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/ba2559ea4fd52db9a181b68ecb735baf/a65ce/2.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 62.83783783783784%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAANCAYAAACpUE5eAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABpElEQVQ4y41TiW6jMBTk/z+v2lTZ0pKEchp8cJnLYfrsphQ22+w+aWQLrHnzxmOvrmu0bUtooBqNOCthrlfYWpblbv0JX/89pRQsaUVr14/gQiJOMvTD+CPpo9XDpipSWIgGUioEpwtms1d6XZVjVzuFxhjM80yHDbhscA5jRAlDmpdu33W9O2zPKClxOgU4Ho84HA7wfd/tgyD4JiyKAmXJCSUyJtASgdYdgvM7TueQSFO0enQWDAQpK4RhhN8vb4iiBHmWYRzHVa23NbXgFepWu/0wTq7BaxCSil/oqck4GQdb/TCvZ60VdlILb+tBVXdImURJPlpVXT8hyRV47mMxeudbQ5NkjDuFdjrGmINTeL0Rclmj05/yq6ohL3MaPcLz8xPZ0Dp1szFrc6ma3aW4kdM0hfWRsRysVBDkke14vsQUn9RdkP2uKVITEdrVQqgWzW3kbR53Hgq65Thl5GVNeaxwoVvW/YBHtfyRobscikq7cPuvAabZ3Ez/9wtZCTnnEEK4MWsymguFJM3vQr0leFSe1hpfsIRZISlX01+f1f/UB+Wx8gV5FT8vAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;2&quot;
        title=&quot;2&quot;
        src=&quot;/static/ba2559ea4fd52db9a181b68ecb735baf/a65ce/2.png&quot;
        srcset=&quot;/static/ba2559ea4fd52db9a181b68ecb735baf/12f09/2.png 148w,
/static/ba2559ea4fd52db9a181b68ecb735baf/e4a3f/2.png 295w,
/static/ba2559ea4fd52db9a181b68ecb735baf/a65ce/2.png 559w&quot;
        sizes=&quot;(max-width: 559px) 100vw, 559px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;위는 크기가 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;M&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.68333em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;인 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;v&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.43056em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;v&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;벡터, 즉 투사층(Projection Layer)을 계산하는 과정입니다. 좌측의 입력층(Input Layer)에는 앞서 사용자가 정한 윈도우 크기 범위 내에 있는 주변 단어들의 원-핫 벡터들이 입력으로 주어지게 됩니다.&lt;/p&gt;
&lt;p&gt;이 입력값들에 크기가 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;V \times M&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.76666em;vertical-align:-0.08333em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.22222em;&quot;&gt;V&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222222222222222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222222222222222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.68333em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;(&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;V&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.68333em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.22222em;&quot;&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;: 입력에 사용된 원-핫 벡터의 크기, &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;M&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.68333em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;: 투사층 벡터 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;v&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.43056em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;v&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;의 크기)인 가중치(Weight) 행렬 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W_{V \times M}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.891661em;vertical-align:-0.208331em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.328331em;&quot;&gt;&lt;span style=&quot;top:-2.5500000000000003em;margin-left:-0.13889em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathdefault mtight&quot; style=&quot;margin-right:0.22222em;&quot;&gt;V&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mathdefault mtight&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.208331em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;을 각각 곱해서 모두 더한 뒤, 그 벡터에 window size에 2를 곱한 값을 나눈 벡터, 즉 평균인 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;v&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.43056em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;v&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;를 계산하게 됩니다.
&lt;br/&gt;&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/2011b2e190658d2d485283ba3ced26d9/6db71/3.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 33.108108108108105%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAHCAYAAAAIy204AAAACXBIWXMAAA7DAAAOwwHHb6hkAAABFklEQVQoz31RCW7DIBD0/9/oRG5kN4k5DTic9mRBqdtGapG4FnZmdrbDa8SU8TEx+JDaed93/DdySpiGAZfTCd7aI94JISClxMw4rrM6wCtwiBmlbNgIe6Ol3uvbYh9QTGAfxzbDNEEbh33b0CViKqXA2BXjVTTAClIThbJEosE4kaoFzq2IMYLLBdOVHapCTLjPkvIyuq9gygXDyIjpAes8Mt2rSiYMpODo+x7n8xlKK1hSw27vgAKrc+i890jEaqw7FOaXQk4KL+RrJatJPsT2LqTGON3w0/+Zq1Zpp7XGsizgXOLzLn952HwM4SCIqbQplGllt79VjLHUA4HHun6XHMnwqrB1Of7d5fe4MQZVlKVO1/0JHqwfLVMnk5IAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;3&quot;
        title=&quot;3&quot;
        src=&quot;/static/2011b2e190658d2d485283ba3ced26d9/fcda8/3.png&quot;
        srcset=&quot;/static/2011b2e190658d2d485283ba3ced26d9/12f09/3.png 148w,
/static/2011b2e190658d2d485283ba3ced26d9/e4a3f/3.png 295w,
/static/2011b2e190658d2d485283ba3ced26d9/fcda8/3.png 590w,
/static/2011b2e190658d2d485283ba3ced26d9/6db71/3.png 659w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이렇게 구한 벡터는 두번째 가중치 행렬 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;/mrow&gt;&lt;mo mathvariant=&quot;normal&quot;&gt;′&lt;/mo&gt;&lt;/msubsup&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W^ \prime _{M \times V}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1.0855540000000001em;vertical-align:-0.333662em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.751892em;&quot;&gt;&lt;span style=&quot;top:-2.424669em;margin-left:-0.13889em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathdefault mtight&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;span class=&quot;mbin mtight&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mord mathdefault mtight&quot; style=&quot;margin-right:0.22222em;&quot;&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.333662em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;와 곱해지게 됩니다. 여기에서 나온 결과값은 입력층에서 입력으로 받은 원-핫 벡터의 크기와 같아집니다. &lt;/p&gt;
&lt;p&gt;이 벡터에 CBOW는 소프트맥스(softmax) 함수를 취하는데, 소프트맥스 함수로 인한 출력값은 0과 1사이의 실수로, 각 원소의 총 합은 1이 되는 상태로 바뀝니다. 이렇게 나온 벡터를 스코어 벡터(score vector)라고 합니다. 스코어 벡터의 각 차원 안에서의 값이 의미하는 것은 아래와 같습니다.&lt;/p&gt;
&lt;p&gt;스코어 벡터의 j번째 인덱스가 가진 0과 1사이의 값은 j번째 단어가 중심 단어일 확률을 나타냅니다. 그리고 이 스코어 벡터는 우리가 실제로 값을 알고있는 벡터인 중심 단어 원-핫 벡터의 값에 가까워져야 합니다.&lt;/p&gt;
&lt;p&gt;스코어 벡터를 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;\hat{y}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8888799999999999em;vertical-align:-0.19444em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.69444em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.19444em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.19444em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;, 중심 단어를 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;y&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.625em;vertical-align:-0.19444em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;라고 했을 때, 이 두 벡터값의 오차를 줄이기위해 CBOW는 손실 함수(loss function)로 cross-entropy 함수를 사용합니다.&lt;/p&gt;
&lt;span class=&quot;katex-display&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;munderover&gt;&lt;mo&gt;∑&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo fence=&quot;true&quot;&gt;∣&lt;/mo&gt;&lt;mi&gt;V&lt;/mi&gt;&lt;mo fence=&quot;true&quot;&gt;∣&lt;/mo&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;log&lt;/mi&gt;&lt;mo&gt;⁡&lt;/mo&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;H(\hat{y}, y) = - \sum_{j=1}^{\left\vert V \right\vert} y_j \log(\hat{y_i})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.08125em;&quot;&gt;H&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.69444em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.19444em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.19444em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.16666666666666666em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2777777777777778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2777777777777778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:3.3747820000000006em;vertical-align:-1.4137769999999998em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;−&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.16666666666666666em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mop op-limits&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:1.9610050000000006em;&quot;&gt;&lt;span style=&quot;top:-1.872331em;margin-left:0em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3.05em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;mord mathdefault mtight&quot; style=&quot;margin-right:0.05724em;&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;mrel mtight&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mord mtight&quot;&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3.050005em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3.05em;&quot;&gt;&lt;/span&gt;&lt;span&gt;&lt;span class=&quot;mop op-symbol large-op&quot;&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-4.386005em;margin-left:0em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3.05em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;&lt;span class=&quot;minner mtight&quot;&gt;&lt;span class=&quot;mopen mtight delimcenter&quot; style=&quot;top:0em;&quot;&gt;&lt;span class=&quot;mtight&quot;&gt;∣&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault mtight&quot; style=&quot;margin-right:0.22222em;&quot;&gt;V&lt;/span&gt;&lt;span class=&quot;mclose mtight delimcenter&quot; style=&quot;top:0em;&quot;&gt;&lt;span class=&quot;mtight&quot;&gt;∣&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:1.4137769999999998em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.16666666666666666em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.311664em;&quot;&gt;&lt;span style=&quot;top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mathdefault mtight&quot; style=&quot;margin-right:0.05724em;&quot;&gt;j&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.286108em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.16666666666666666em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mop&quot;&gt;lo&lt;span style=&quot;margin-right:0.01389em;&quot;&gt;g&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.69444em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.31166399999999994em;&quot;&gt;&lt;span style=&quot;top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mathdefault mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.25em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.19444em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;cross-entropy 함수에 실제 중심 단어인 원-핫 벡터와 스코어 벡터를 입력값으로 넣은 수식은 위와 같습니다.&lt;/p&gt;
&lt;span class=&quot;katex-display&quot;&gt;&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo separator=&quot;true&quot;&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;log&lt;/mi&gt;&lt;mo&gt;⁡&lt;/mo&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;H(\hat{y}, y) = - y_i \log(\hat{y_i})&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.08125em;&quot;&gt;H&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.69444em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.19444em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.19444em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mpunct&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.16666666666666666em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2777777777777778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2777777777777778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;−&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.31166399999999994em;&quot;&gt;&lt;span style=&quot;top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mathdefault mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.16666666666666666em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mop&quot;&gt;lo&lt;span style=&quot;margin-right:0.01389em;&quot;&gt;g&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.69444em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.31166399999999994em;&quot;&gt;&lt;span style=&quot;top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mathdefault mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.25em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.19444em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;p&gt;중심 단어가 원-핫 벡터라는 것을 감안하면, 식을 위처럼 쓸 수도 있습니다. (&lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;i&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.65952em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;는 원-핫 벡터에서 0이 아닌 index)&lt;/p&gt;
&lt;p&gt;만약 정확하게 예측한 경우, &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mover accent=&quot;true&quot;&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;\hat{y_i} = 1&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.8888799999999999em;vertical-align:-0.19444em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord accent&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.69444em;&quot;&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.03588em;&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t vlist-t2&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.31166399999999994em;&quot;&gt;&lt;span style=&quot;top:-2.5500000000000003em;margin-left:-0.03588em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mathdefault mtight&quot;&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.15em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;top:-3em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:3em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;accent-body&quot; style=&quot;left:-0.25em;&quot;&gt;&lt;span class=&quot;mord&quot;&gt;^&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-s&quot;&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.19444em;&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2777777777777778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2777777777777778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.64444em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 이 되며, cross-entropy의 값은 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo&gt;−&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;×&lt;/mo&gt;&lt;mi&gt;log&lt;/mi&gt;&lt;mo&gt;⁡&lt;/mo&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;-1 \times \log(1) = 0&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.72777em;vertical-align:-0.08333em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;−&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222222222222222em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mbin&quot;&gt;×&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2222222222222222em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:1em;vertical-align:-0.25em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mop&quot;&gt;lo&lt;span style=&quot;margin-right:0.01389em;&quot;&gt;g&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;mopen&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;mclose&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2777777777777778em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mrel&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mspace&quot; style=&quot;margin-right:0.2777777777777778em;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.64444em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt; 이 됩니다. 따라서 해당 식을 최소화하는 방향으로 학습해야 합니다.&lt;/p&gt;
&lt;p&gt;이제 역전파(Back Propagation)를 수행하면 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.68333em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;와 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mo mathvariant=&quot;normal&quot;&gt;′&lt;/mo&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W^ \prime&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.751892em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.751892em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;이 학습이 되는데, 학습이 다 되었다면 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;M&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.68333em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.10903em;&quot;&gt;M&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;차원의 크기를 갖는 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.68333em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;의 행이나 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mo mathvariant=&quot;normal&quot;&gt;′&lt;/mo&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W^ \prime&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.751892em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.751892em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;의 열로부터 어떤 것을 임베딩 벡터로 사용할지를 결정하면 됩니다. 때로는 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.68333em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;와 &lt;span class=&quot;katex&quot;&gt;&lt;span class=&quot;katex-mathml&quot;&gt;&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mo mathvariant=&quot;normal&quot;&gt;′&lt;/mo&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;annotation encoding=&quot;application/x-tex&quot;&gt;W^ \prime&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;span class=&quot;katex-html&quot; aria-hidden=&quot;true&quot;&gt;&lt;span class=&quot;base&quot;&gt;&lt;span class=&quot;strut&quot; style=&quot;height:0.751892em;vertical-align:0em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;mord&quot;&gt;&lt;span class=&quot;mord mathdefault&quot; style=&quot;margin-right:0.13889em;&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;msupsub&quot;&gt;&lt;span class=&quot;vlist-t&quot;&gt;&lt;span class=&quot;vlist-r&quot;&gt;&lt;span class=&quot;vlist&quot; style=&quot;height:0.751892em;&quot;&gt;&lt;span style=&quot;top:-3.063em;margin-right:0.05em;&quot;&gt;&lt;span class=&quot;pstrut&quot; style=&quot;height:2.7em;&quot;&gt;&lt;/span&gt;&lt;span class=&quot;sizing reset-size6 size3 mtight&quot;&gt;&lt;span class=&quot;mord mtight&quot;&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;의 평균치를 가지고 임베딩 벡터를 선택하기도 합니다.&lt;/p&gt;
&lt;h2 id=&quot;skip-gram&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#skip-gram&quot; aria-label=&quot;skip gram permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;Skip-gram&lt;/h2&gt;
&lt;p&gt;Skip-gram은 앞서 설명한 CBOW와 반대의 과정입니다. Skip-gram은 중심 단어를 보고, 어떤 주변 단어들이 올지 예측하기 위한 방법입니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 468px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/1f321f53b53523f30dc7688ca62f0fda/90372/4.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 43.91891891891892%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAJCAYAAAAywQxIAAAACXBIWXMAAA7DAAAOwwHHb6hkAAABOklEQVQoz31Sa2+DMAzk//8+9qGdpk0dLSkhDxIgPHuzwyKxPmbJAg7ncmc7A8XtdkN6prz/Xtc1Ykop5HmOw+GIoihwH9krwoQ9i0q1uEoDYzSlgbUWUkpM04TsP4K+DzidTijLEkIIVHQoDBPKyqKqDZxr0HYd2rald4dlWTaFKepakp03HI/vGIbAmgmr8fH5jfNFoDgLjOMMqT2UcUTkI1lHpN57+jciY8nM3jQNpYUgK1zMB8dpAbeuEIZwR+pmsjXjcqW6SkNTP5Nl7m0IARl7Z5ABJrSuh/MBAxGGMKKn/CokrjVd6PtNofpVSKpYGatkQQ+WOZZlfeinabp4gWtDVMjkUtnYQ3bHyaLmef47lD3RK4yHslneHLGyRBgt74v3RM/WKOE8lErquDbcPybVWtNW9PgBoyW6CDZbbd0AAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;4&quot;
        title=&quot;4&quot;
        src=&quot;/static/1f321f53b53523f30dc7688ca62f0fda/90372/4.png&quot;
        srcset=&quot;/static/1f321f53b53523f30dc7688ca62f0fda/12f09/4.png 148w,
/static/1f321f53b53523f30dc7688ca62f0fda/e4a3f/4.png 295w,
/static/1f321f53b53523f30dc7688ca62f0fda/90372/4.png 468w&quot;
        sizes=&quot;(max-width: 468px) 100vw, 468px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;위 사진과 같은 과정으로 이루어지며, 중심 단어 하나에 대해서만 입력을 받기 때문에, CBOW처럼 평균을 구하는 과정은 없습니다. &lt;/p&gt;</content:encoded></item><item><title><![CDATA[자연어 처리(NLP)의 전처리 - 워드 임베딩]]></title><description><![CDATA[희소 표현(Sparce Representation…]]></description><link>https://mintyu.github.io/Pytorch05/</link><guid isPermaLink="false">https://mintyu.github.io/Pytorch05/</guid><pubDate>Sun, 02 May 2021 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;희소-표현sparce-representation&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%ED%9D%AC%EC%86%8C-%ED%91%9C%ED%98%84sparce-representation&quot; aria-label=&quot;희소 표현sparce representation permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;희소 표현(Sparce Representation)&lt;/h2&gt;
&lt;p&gt;희소 표현이란 데이터가 희소하다는 의미입니다. 여기서 데이터란 &apos;우리가 관심을 갖는, 의미있는 정보&apos;정도로 해석할 수 있을 것 같습니다.&lt;/p&gt;
&lt;p&gt;이해하기 쉽게 예를 들어 설명해봅시다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;text&quot;&gt;&lt;pre class=&quot;language-text&quot;&gt;&lt;code class=&quot;language-text&quot;&gt;[1, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0, 0]

...&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;위의 벡터들은 희소 벡터라고 할 수 있습니다. 우리가 알고싶어 하는 정보의 인덱스만 1로 표현되고, 나머지 원소들은 0으로 이루어진 벡터입니다. 따라서 실제 표현하는 데이터의 양은 희소(Sparse)하며, 데이터를 표현하지 않는 나머지는 모두 0으로 표현됩니다.&lt;/p&gt;
&lt;p&gt;이와 같이 표현하는 것을 희소 표현이라고 합니다.&lt;/p&gt;
&lt;p&gt;지난 &lt;a href=&quot;https://mintyu.github.io/Pytorch04/&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;원-핫 인코딩&lt;/a&gt; 게시물에서 배운 원-핫 인코딩을 통해 나온 원-핫 벡터 또한 희소 벡터입니다.&lt;/p&gt;
&lt;h2 id=&quot;밀집-표현dense-representation&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EB%B0%80%EC%A7%91-%ED%91%9C%ED%98%84dense-representation&quot; aria-label=&quot;밀집 표현dense representation permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;밀집 표현(Dense Representation)&lt;/h2&gt;
&lt;p&gt;밀집 표현(Dense Representation)은 희소 표현(Sparse Representation)과 반대대는 표현입니다. 희소 표현의 일종인 원-핫 인코딩에서는 벡터의 차원이 곧 단어 집합의 크기이고, 표현하고자 하는 단어의 인덱스의 값만 1이고 나머지는 모두 0인 &lt;code class=&quot;language-text&quot;&gt;ex) [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]&lt;/code&gt; 표현 방식이지만, 밀집 표현은 다릅니다.&lt;/p&gt;
&lt;p&gt;만약 단어 집합의 크기가 10000이라 가정해본다면, 이를 원-핫 인코딩을 통해 희소 벡터(희소 표현을 통해 생성된 벡터)로 표현하게 되면 차원이 10000인 희소 벡터가 생성될 것입니다.&lt;/p&gt;
&lt;p&gt;하지만 이를 밀집 표현으로 나타내고 사용자가 밀집 표현의 차원을 128로 설정했다면, 밀집 벡터(밀집 표현을 통해 생성된 벡터)의 차원은 128이 되며 벡터 안의 모든 값들은 실수가 됩니다.&lt;/p&gt;
&lt;p&gt;이와 같이 단어를 &lt;strong&gt;밀집 벡터&lt;/strong&gt;로 표현하는 과정을 &lt;strong&gt;워드 임베딩&lt;/strong&gt;이라고 합니다.&lt;/p&gt;
&lt;h2 id=&quot;워드-임베딩word-embedding&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%9B%8C%EB%93%9C-%EC%9E%84%EB%B2%A0%EB%94%A9word-embedding&quot; aria-label=&quot;워드 임베딩word embedding permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;워드 임베딩(Word Embedding)&lt;/h2&gt;
&lt;p&gt;워드 임베딩이란, 단어를 &lt;strong&gt;밀집 벡터(Dense Vector)&lt;/strong&gt;의 형태로 표현하는 방법입니다. 또한 이 워드 임베딩 과정을 거쳐 나온 밀집 벡터의 이름을 &lt;strong&gt;임베딩 벡터(Embedding Vector)&lt;/strong&gt;라고 합니다.&lt;/p&gt;
&lt;p&gt;아래는 원-핫 벡터와 임베딩 벡터의 차이를 나타낸 표입니다.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;원-핫 벡터&lt;/th&gt;
&lt;th&gt;임베딩 벡터&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;차원&lt;/td&gt;
&lt;td&gt;고차원&lt;/td&gt;
&lt;td&gt;저차원&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;다른 표현&lt;/td&gt;
&lt;td&gt;희소 벡터의 일종&lt;/td&gt;
&lt;td&gt;밀집 벡터의 일종&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;표현 방법&lt;/td&gt;
&lt;td&gt;수동&lt;/td&gt;
&lt;td&gt;훈련 데이터로부터 학습&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;값의 타입&lt;/td&gt;
&lt;td&gt;0, 1&lt;/td&gt;
&lt;td&gt;real number&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;</content:encoded></item><item><title><![CDATA[자연어 처리(NLP)의 전처리 - 원-핫 인코딩]]></title><description><![CDATA[원-핫 인코딩(One-Hot Encoding) 지난 게시물인 정수 인코딩에서는 Vocabulary…]]></description><link>https://mintyu.github.io/Pytorch04/</link><guid isPermaLink="false">https://mintyu.github.io/Pytorch04/</guid><pubDate>Sun, 25 Apr 2021 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;원-핫-인코딩one-hot-encoding&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%9B%90-%ED%95%AB-%EC%9D%B8%EC%BD%94%EB%94%A9one-hot-encoding&quot; aria-label=&quot;원 핫 인코딩one hot encoding permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;원-핫 인코딩(One-Hot Encoding)&lt;/h2&gt;
&lt;p&gt;지난 게시물인 &lt;a href=&quot;https://mintyu.github.io/Pytorch03/&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;정수 인코딩&lt;/a&gt;에서는 Vocabulary의 각 단어에 인덱스를 부여하는 정수 인코딩까지 마쳤습니다. 그리고 이후 원-핫 인코딩과 워드 임베딩을 통해 단어들을 벡터로 바꿔준다고 말미에 언급했었습니다. 이번 게시물은 그 중 &quot;원-핫 인코딩(One-Hot Encoding)&quot;에 대한 내용입니다.&lt;/p&gt;
&lt;h3 id=&quot;원-핫-인코딩이란&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%9B%90-%ED%95%AB-%EC%9D%B8%EC%BD%94%EB%94%A9%EC%9D%B4%EB%9E%80&quot; aria-label=&quot;원 핫 인코딩이란 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;원-핫 인코딩이란?&lt;/h3&gt;
&lt;p&gt;원-핫 인코딩이란, 단어 집합의 크기를 차원으로 하고 표현하고 싶은 단어의 인덱스에 1의 값을 부여하고, 다른 인덱스에는 0을 부여하는 단어의 벡터 표현 방식입니다. 이렇게 표현된 벡터의 이름이 원-핫 벡터(One-Hot Vector)입니다. 여기서 단어의 인덱스는 앞서 정수 인코딩을 통해 맵핑한 데이터를 기반으로 합니다.&lt;/p&gt;
&lt;p&gt;예시로, &lt;code class=&quot;language-text&quot;&gt;&amp;quot;자연어 전처리 과정입니다.&amp;quot;&lt;/code&gt; 라는 한글로 된 문장을 원-핫 벡터로 만드는 과정을 보겠습니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;from&lt;/span&gt; konlpy&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;tag &lt;span class=&quot;token keyword&quot;&gt;import&lt;/span&gt; Okt  
okt &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; Okt&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
token &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; okt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;morphs&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;자연어 전처리 과정입니다.&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;token&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code class=&quot;language-text&quot;&gt;konlpy&lt;/code&gt; 모듈의 &lt;code class=&quot;language-text&quot;&gt;Okt&lt;/code&gt; 형태소 분석기를 통해 문장에 대해 토큰화를 수행하는 과정입니다.&lt;/p&gt;
&lt;p&gt;이 과정에서 &lt;code class=&quot;language-text&quot;&gt;konlpy&lt;/code&gt; 모듈이 설치되지 않았다면 &lt;code class=&quot;language-text&quot;&gt;$pip install konlpy&lt;/code&gt; 명령어를 통해 설치해주시면 되고, 실행 과정에서 No JVM 오류가 발생한다면 JVM 설치를 진행해주시면 됩니다. 필자의 경우 도커 컨테이너 상에서 실습을 진행하고 있기에 &lt;code class=&quot;language-text&quot;&gt;$apt install default-jdk&lt;/code&gt;를 통해 설치를 진행했습니다. (각자의 운영체제 상황에 맞는 설치 방법으로 설치하시면 됩니다.)&lt;/p&gt;
&lt;p&gt;실행 결과는 다음과 같습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/be6ff415d1ee07fe3c57a37e90718c5f/7131f/1.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 10.81081081081081%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAACCAYAAABYBvyLAAAACXBIWXMAABYlAAAWJQFJUiTwAAAAZklEQVQI11WNSQoAIQwE5yUqrqjgghv+/2E9JIeBuaaqOo9zDnNO3HuRc4YQAmMMWGtB7JyDEAKUUlhroffOTmuNu5QSM3LJeygkae/9wVIKvPfQWvOdRCnlb7DWys9ijNwYY7h5AcKLO5tNAAy9AAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;1&quot;
        title=&quot;1&quot;
        src=&quot;/static/be6ff415d1ee07fe3c57a37e90718c5f/fcda8/1.png&quot;
        srcset=&quot;/static/be6ff415d1ee07fe3c57a37e90718c5f/12f09/1.png 148w,
/static/be6ff415d1ee07fe3c57a37e90718c5f/e4a3f/1.png 295w,
/static/be6ff415d1ee07fe3c57a37e90718c5f/fcda8/1.png 590w,
/static/be6ff415d1ee07fe3c57a37e90718c5f/7131f/1.png 710w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;그리고 아래와 같이 각 토큰에 대해 인덱스를 부여해줍니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;word2index &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; voca &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; token&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
     &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; voca &lt;span class=&quot;token keyword&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; word2index&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;keys&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
       word2index&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;voca&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;word2index&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;word2index&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;결과는 다음과 같습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/f1f833b1a3dffb048e41ba4c2f11f3a2/b12f7/2.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 8.108108108108107%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAACCAYAAABYBvyLAAAACXBIWXMAABYlAAAWJQFJUiTwAAAAXElEQVQI102NWwoAIQwDPUrFJyooVrz/zbKkIOxXmjBDXc4ZrTUw11qIMcJ7jzkn9t4IISClZMy9F6UU68xzDsYY5rDTcxR67ybWWg0mwPvthP8P2ZmqatzjRQQfao86k7GQkAAAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;2&quot;
        title=&quot;2&quot;
        src=&quot;/static/f1f833b1a3dffb048e41ba4c2f11f3a2/fcda8/2.png&quot;
        srcset=&quot;/static/f1f833b1a3dffb048e41ba4c2f11f3a2/12f09/2.png 148w,
/static/f1f833b1a3dffb048e41ba4c2f11f3a2/e4a3f/2.png 295w,
/static/f1f833b1a3dffb048e41ba4c2f11f3a2/fcda8/2.png 590w,
/static/f1f833b1a3dffb048e41ba4c2f11f3a2/efc66/2.png 885w,
/static/f1f833b1a3dffb048e41ba4c2f11f3a2/b12f7/2.png 1020w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;token function&quot;&gt;one_hot_encoding&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;word&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; word2index&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    one_hot_vector &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;word2index&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
    index &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; word2index&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;word&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
    one_hot_vector&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;index&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;return&lt;/span&gt; one_hot_vector&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;위는 토큰을 입력하면 그에 해당하는 원-핫 벡터를 반환해주는 &lt;code class=&quot;language-text&quot;&gt;one_hot_encoding&lt;/code&gt;함수입니다.&lt;/p&gt;
&lt;p&gt;다음과 같이 토큰을 통해 함수를 호출하면, 원-핫 벡터를 얻을 수 있습니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;one_hot_encoding&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&quot;자연어&quot;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt;word2index&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 328px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/b213d23037aa6be243a8dc7872ea1b1c/d5c60/3.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 29.72972972972973%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAGCAYAAADDl76dAAAACXBIWXMAABYlAAAWJQFJUiTwAAAAz0lEQVQY022RuwqFMBBE/QcRfL8q7axEEbUQVBSxVOz9/0+Yy6xEJNxi5OxkslkTIwxDBEEgIn/19f6t/8sb/MRx/DalPM+D7/sSIlNk13XFVxnl06OiKIJBaNsW8zwjTVMxr+tCURSwLAvHcaCua5imifM80TSN8LZtGIZBuO973PeNPM+fhlVVYZomCSdJgnVdMY6jTESmyMuySI6Zruuw77sMUZalHJZl2fPLamxOx9q2bfFYk+np7DjOu4eHseYe43vh+iN8fdVIf0CdfzUwrKjWP/eRAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;3&quot;
        title=&quot;3&quot;
        src=&quot;/static/b213d23037aa6be243a8dc7872ea1b1c/d5c60/3.png&quot;
        srcset=&quot;/static/b213d23037aa6be243a8dc7872ea1b1c/12f09/3.png 148w,
/static/b213d23037aa6be243a8dc7872ea1b1c/e4a3f/3.png 295w,
/static/b213d23037aa6be243a8dc7872ea1b1c/d5c60/3.png 328w&quot;
        sizes=&quot;(max-width: 328px) 100vw, 328px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;함수에 &lt;code class=&quot;language-text&quot;&gt;&amp;quot;자연어&amp;quot;&lt;/code&gt; 라는 토큰을 입력으로 넣었더니, &lt;code class=&quot;language-text&quot;&gt;[1, 0, 0, 0, 0, 0]&lt;/code&gt; 이라는 원-핫 벡터를 리턴했습니다.&lt;/p&gt;
&lt;p&gt;앞에서 &lt;code class=&quot;language-text&quot;&gt;&amp;quot;자연어&amp;quot;&lt;/code&gt; 토큰은 인덱스 0번으로 정수 인코딩이 되었으니, 해당하는 0번 인덱스만 1의 값을 갖는 원-핫 벡터가 나오게 됩니다.&lt;/p&gt;
&lt;h3 id=&quot;원-핫-인코딩의-한계&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%9B%90-%ED%95%AB-%EC%9D%B8%EC%BD%94%EB%94%A9%EC%9D%98-%ED%95%9C%EA%B3%84&quot; aria-label=&quot;원 핫 인코딩의 한계 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;원-핫 인코딩의 한계&lt;/h3&gt;
&lt;p&gt;원-핫 인코딩과 같은 표현 방식은, 단어의 수가 늘어날 수록 벡터의 크기가 점점 커진다는 단점이 있습니다. 단어 집합(Vocabulary)의 크기가 곧 원-핫 벡터의 차수이기 때문에, 단어의 수가 많으면 벡터를 저장하는데 리소스가 많이 필요할 뿐만 아니라 연산에서도 불리합니다.&lt;/p&gt;
&lt;p&gt;또한, 원-핫 벡터는 단어의 유사도를 표현하지 못합니다. 원-핫 인코딩을 통해서는 &lt;code class=&quot;language-text&quot;&gt;[&amp;#39;오렌지&amp;#39;, &amp;#39;사과&amp;#39;, &amp;#39;개&amp;#39;, &amp;#39;고양이&amp;#39;]&lt;/code&gt; 이와 같은 단어 집합이 있을 때, 원-핫 인코딩을 거치면 &lt;code class=&quot;language-text&quot;&gt;[1, 0, 0, 0]&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;[0, 1, 0, 0]&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;[0, 0, 1, 0]&lt;/code&gt;, &lt;code class=&quot;language-text&quot;&gt;[0, 0, 0, 1]&lt;/code&gt; 의 4개의 벡터로 표현됩니다. 인간은 단어 집합을 보고 오렌지와 사과는 과일이며 개와 고양이는 동물이므로 서로 유사한 특징을 묶어낼 수 있지만, 원-핫 벡터로는 그런 내용들을 포함할 수 없습니다. 인간도 단어 집합을 모르는 상태에서 원-핫 벡터만 봐서는 무엇과 무엇이 서로 유사한지 전혀 알 수 없을 것입니다.&lt;/p&gt;
&lt;p&gt;따라서 단어 간 유사도까지 표현할 수 있는 워드 임베딩 방식을 많이 채택합니다. &lt;/p&gt;</content:encoded></item><item><title><![CDATA[자연어 처리(NLP)의 전처리 - 정수 인코딩, 패딩]]></title><description><![CDATA[정수 인코딩 는 순서가 있는 자료형(list, set, tuple, dictionary, string)을 입력으로 받아, 인덱스를 순차적으로 리턴해줍니다. 앞선 Vocabulary 게시물에서 만든 단어 집합을 가져와서 정수로 바꿔줍니다. 인덱스…]]></description><link>https://mintyu.github.io/Pytorch03/</link><guid isPermaLink="false">https://mintyu.github.io/Pytorch03/</guid><pubDate>Thu, 15 Apr 2021 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;정수-인코딩&quot; style=&quot;position:relative;&quot;&gt;&lt;a href=&quot;#%EC%A0%95%EC%88%98-%EC%9D%B8%EC%BD%94%EB%94%A9&quot; aria-label=&quot;정수 인코딩 permalink&quot; class=&quot;anchor before&quot;&gt;&lt;svg aria-hidden=&quot;true&quot; focusable=&quot;false&quot; height=&quot;16&quot; version=&quot;1.1&quot; viewBox=&quot;0 0 16 16&quot; width=&quot;16&quot;&gt;&lt;path fill-rule=&quot;evenodd&quot; d=&quot;M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z&quot;&gt;&lt;/path&gt;&lt;/svg&gt;&lt;/a&gt;정수 인코딩&lt;/h2&gt;
&lt;p&gt;&lt;code class=&quot;language-text&quot;&gt;enumerate()&lt;/code&gt;는 순서가 있는 자료형(list, set, tuple, dictionary, string)을 입력으로 받아, 인덱스를 순차적으로 리턴해줍니다. 앞선 &lt;a href=&quot;https://mintyu.github.io/Pytorch02/&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;Vocabulary&lt;/a&gt; 게시물에서 만든 단어 집합을 가져와서 정수로 바꿔줍니다.&lt;/p&gt;
&lt;p&gt;인덱스 0과 1은 다른 용도로 사용하기 위해 남겨두고, 나머지 인덱스인 2부터 차례대로 부여합니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;word_to_index &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;{&lt;/span&gt;word&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; index &lt;span class=&quot;token operator&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; index&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; word &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;enumerate&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;vocab&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;}&lt;/span&gt;
word_to_index&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;pad&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1&lt;/span&gt;
word_to_index&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;unk&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;0&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;br/&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;encoded &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; line &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; tokenized&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;#입력 데이터에서 1줄씩 문장을 읽음&lt;/span&gt;
    temp &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; w &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; line&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;#각 줄에서 1개씩 글자를 읽음&lt;/span&gt;
    	&lt;span class=&quot;token keyword&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
        	temp&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;append&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;word_to_index&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;w&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;# 글자를 해당되는 정수로 변환&lt;/span&gt;
      	&lt;span class=&quot;token keyword&quot;&gt;except&lt;/span&gt; KeyError&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;# 단어 집합에 없는 단어일 경우 unk로 대체된다.&lt;/span&gt;
        	temp&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;append&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;word_to_index&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;unk&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;# unk의 인덱스로 변환&lt;/span&gt;

    encoded&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;append&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;temp&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;encoded&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;# 10개 항목만 출력&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;출력 결과를 보면 다음과 같습니다.&lt;/p&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/a991c847532d2f7521398bbe12177eda/fd84e/1.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 36.486486486486484%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAHCAYAAAAIy204AAAACXBIWXMAABYlAAAWJQFJUiTwAAAA8klEQVQoz02SV66GUAiEXYk+aIy9917i/pfEzUfCn/twAgwDDKgTx7H0fS9t20pd17+X57kURSFVVak1H7wsS8myTDFi+Pie54kDcByH7Psu67rKtm0yjqPM8yzTNMl5nmqXZVFsGAb5vu/HJQdOHASBOEmSSNd1mkAphTSkEB/lxOTA4FojG0QOQVEUicMaz/PI+74KMg3FFF3XpZg9MHhwjIsymqJaFbK/yTbfVKGGgqZpVJ3l7CzgYPC4q+u64hDQDBKWm9lUU3vft/psAs4zzLagLgxDcWx/yEyxr8tt+QM4SZqmioPhG/7/i2N935c/9pLjJq01llMAAAAASUVORK5CYII=&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;1&quot;
        title=&quot;1&quot;
        src=&quot;/static/a991c847532d2f7521398bbe12177eda/fcda8/1.png&quot;
        srcset=&quot;/static/a991c847532d2f7521398bbe12177eda/12f09/1.png 148w,
/static/a991c847532d2f7521398bbe12177eda/e4a3f/1.png 295w,
/static/a991c847532d2f7521398bbe12177eda/fcda8/1.png 590w,
/static/a991c847532d2f7521398bbe12177eda/efc66/1.png 885w,
/static/a991c847532d2f7521398bbe12177eda/fd84e/1.png 1056w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;이 결과에서 볼 수 있듯, 문장들의 길이는 다 다르며, 문장마다 배열의 길이가 다릅니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;max_len &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;l&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; l &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; encoded&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;리뷰의 최대 길이 : %d&apos;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;%&lt;/span&gt; max_len&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;리뷰의 최소 길이 : %d&apos;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;l&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; l &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; encoded&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;리뷰의 평균 길이 : %f&apos;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; encoded&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;encoded&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;hist&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;s&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; s &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; encoded&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; bins&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;xlabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;length of sample&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;ylabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;number of sample&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;show&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/4a4e80b5e01e2acf71979d456b5fdaf4/5a6dd/2.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 82.43243243243244%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAABYlAAAWJQFJUiTwAAAC3ElEQVQ4y5VTTUgUYRieW3TsomCiZgiJoIbkz4biioI3BQ8e3H4OJnX00KWgLovuJtvKshSrQtQl7IcCKdISIqU0KNMUdX9GXXe33Z2dmZ39b3dmnma+VctWC1945vte5p3ne97nm5eqr69HW1sb6urq0NzcDK1Wi8rKStTU1KCqqgoFBQUoLCxEUVERiouL/wm1htJoNGhtbUVtbS0aGhrQ2NSEiooKlJeXo7S0FCUlJSgrKyOkeXl5yM/PPxDqO/Vwqr29HX19fejp6YFOp8PlixfQ0dFB0N3djd7eXvT396OrqwtNymEtLS2ki7+hiqqurgZltVqhhj8QQJjn4PAGEWQ5RAQBPM+D4zgwDEP2kUjkQKjv1NDr9aDGxsZIwiofqvHOxYNPiThKyLJM1sHBQVCjo6MkYUIhsk7aGQSjqd1KSArk/yCTyZDygYEBUGazGZIkIcRmFU47WbDxNNkn0uI+BYep2yUkCtVHOp3eI3z87Qe4RBqftnjMbma9kY5CaDKZsi3vEF5/vY5Xq0E8+urFhLKqIUryoSpzCA0GAzKiSG5KVlrX3JvH1RcrGJnfVvzM+pqRcn3bsTiX8M7QEEm2/SFMrQdxxjQL3fgShmc38XIlsE/N7gVJe4RyLqHNZiMJ7Q0ohAzOWj7i0pPvuKaovPJ8GT8zEtzhZI6PWZVEIkTxT8KREZJ4/Azuzmzg9NAMThk/4Pz9eZyzzuHWlBOG9xukZotPwiOk4GIT+8hFUcxVyLEsOh98xrEbb3D85iRO6qdx4vZbaG1z6Hz4BRtsHM+WfFj0CUonihWyiKeLXowveDCx7IMjLMJkNPyeFHXEaD8LRyCMBYcbdCiKBZcHdk8ATl8IdnoTbCSGLfc22LCAZDKBFdoNJpbCkp2G2vTwsBmUxWLZI0zEokjFY4hHBKQSMWUfBccE4Vhfg9NhB+1yYm1tlcy5OsNRIYykUs+z2b/BaDTiFxDzD8+oIsrvAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;2&quot;
        title=&quot;2&quot;
        src=&quot;/static/4a4e80b5e01e2acf71979d456b5fdaf4/fcda8/2.png&quot;
        srcset=&quot;/static/4a4e80b5e01e2acf71979d456b5fdaf4/12f09/2.png 148w,
/static/4a4e80b5e01e2acf71979d456b5fdaf4/e4a3f/2.png 295w,
/static/4a4e80b5e01e2acf71979d456b5fdaf4/fcda8/2.png 590w,
/static/4a4e80b5e01e2acf71979d456b5fdaf4/5a6dd/2.png 802w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;문장의 최대 길이는 63, 최소 길이는 1, 평균은 15.62로 균일하지 않음을 알 수 있습니다. 여기에서, 가장 긴 문장의 길이인 63으로 길이를 맞춰주어야 합니다. 이 때 필요한 것이 앞에서 본 &lt;code class=&quot;language-text&quot;&gt;pad&lt;/code&gt; 토큰입니다. 앞서, (정수 인코딩을 진행하며 0번 인덱스에는 &lt;code class=&quot;language-text&quot;&gt;unk&lt;/code&gt;, 1번 인덱스에는 &lt;code class=&quot;language-text&quot;&gt;pad&lt;/code&gt;을 할당했었습니다.) 길이가 63보다 짧은 문장의 남는 공간에 pad 토큰을 넣음으로써 길이를 맞춰줄 수 있습니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; line &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; encoded&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;token keyword&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;line&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt; max_len&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;# 현재 샘플이 정해준 길이보다 짧으면&lt;/span&gt;
        line &lt;span class=&quot;token operator&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;word_to_index&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;pad&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;max_len &lt;span class=&quot;token operator&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;line&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token comment&quot;&gt;# 나머지는 전부 &apos;pad&apos; 토큰으로 채운다.&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;다시 결과를 출력해보겠습니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;max_len &lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;l&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; l &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; encoded&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;리뷰의 최대 길이 : %d&apos;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;%&lt;/span&gt; max_len&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;리뷰의 최소 길이 : %d&apos;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;token builtin&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;l&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; l &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; encoded&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;리뷰의 평균 길이 : %f&apos;&lt;/span&gt; &lt;span class=&quot;token operator&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; encoded&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token operator&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;encoded&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;hist&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token builtin&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;s&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; s &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; encoded&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;,&lt;/span&gt; bins&lt;span class=&quot;token operator&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;xlabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;length of sample&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;ylabel&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token string&quot;&gt;&apos;number of sample&apos;&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;
plt&lt;span class=&quot;token punctuation&quot;&gt;.&lt;/span&gt;show&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/20ce07b7d8ca31f8fb6257449a1b494a/a4262/3.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 81.75675675675677%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAQCAYAAAAWGF8bAAAACXBIWXMAABYlAAAWJQFJUiTwAAACB0lEQVQ4y61TuY4aQRCdzzB3QoAwEAHiEBIECAERCQnfQQxCIuQQAQniSPkIr+UMpDWhvbLXgBF7cM0MNgtGMDxPtRkMi01ESU9d9ab6VVV3D+dwOOD3+0Gr1+uFz+eD2WyG3W6H1WqFSqWCTqeDXq+HwWC4CMrhSCQUCsHtdjPBQCDABJ1OJ2w2G0s0Go3QarV4I4ur1Wq5iJqtx6DCGo0GnMfjQSwWQzQaRTgcRiQSgcvlYl0Hg0HE43EkEgl4PG5Y3pphsVhlWM5A05hMJnDJZBKz2Qz9fh88zzMIgsAgiiKLRYHH43iG789T/JiLEGRePALtlyQJlUoFXKFQABkJXLKv/C986IsXcxqNBrhisciC4XDIqux2uxNsZY7s8+gn3t1Pmb99lbfZbBhfq9XAZbNZFtBo0n4zJSkm7f278QI397M/nLQ75LCi2+1fwXQ6jfV6jfF4fB3BTCbDAuVgr9IhkZPJ5LodXk0wn8+zYDQanQgqoM2vBbfS7v+3rLxDEvyXKb3eTV5w840/4Q5T7Bup1+vgcrkcC6bTKZbLJVar1WElvMg+pA0+PQp4/+VZVtswTvlOWCwWTKNarYIrlUosoA/Ke6QbV3w6Wxrp4ekJg+EDtrKv/J7HOWTNZhNcKpXCfD5Hr9fDYDA4A/GtVgudTgcfb2/RbrfR7XbPcki4XC7jN35YLfvox+sfAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;3&quot;
        title=&quot;3&quot;
        src=&quot;/static/20ce07b7d8ca31f8fb6257449a1b494a/fcda8/3.png&quot;
        srcset=&quot;/static/20ce07b7d8ca31f8fb6257449a1b494a/12f09/3.png 148w,
/static/20ce07b7d8ca31f8fb6257449a1b494a/e4a3f/3.png 295w,
/static/20ce07b7d8ca31f8fb6257449a1b494a/fcda8/3.png 590w,
/static/20ce07b7d8ca31f8fb6257449a1b494a/a4262/3.png 814w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;패딩을 통해 길이가 모두 63으로 통일된 것을 확인할 수 있습니다!&lt;/p&gt;
&lt;p&gt;이렇게 만들어진 전처리의 결과물의 형태는 다음과 같습니다.&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;python&quot;&gt;&lt;pre class=&quot;language-python&quot;&gt;&lt;code class=&quot;language-python&quot;&gt;&lt;span class=&quot;token keyword&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;encoded&lt;span class=&quot;token punctuation&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;span
      class=&quot;gatsby-resp-image-wrapper&quot;
      style=&quot;position: relative; display: block; margin-left: auto; margin-right: auto;  max-width: 590px;&quot;
    &gt;
      &lt;a
    class=&quot;gatsby-resp-image-link&quot;
    href=&quot;/static/ea34c6d579f36e3d0ca5fa8934a5a5d3/5819f/4.png&quot;
    style=&quot;display: block&quot;
    target=&quot;_blank&quot;
    rel=&quot;noopener&quot;
  &gt;
    &lt;span
    class=&quot;gatsby-resp-image-background-image&quot;
    style=&quot;padding-bottom: 36.486486486486484%; position: relative; bottom: 0; left: 0; background-image: url(&apos;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAHCAYAAAAIy204AAAACXBIWXMAABYlAAAWJQFJUiTwAAAA1UlEQVQoz22RVw7EMAhEfZZISZRenF5/cv8rsXpIWNHufuAZbBjAuDzPpe978d5L13XKwbqupaoqxbcVRSFZlv0YOlEUiSPARN5oRb7v4G3bBjTOW1mW4qzDpmlCFxYwTZPyYRhUfBxHvYNv2xZ8i9EOqXCepxzHoUHYsizqP88j+77r3bqumnxdl/rzPCve962cd5pyHDhUA0mgA4QQtVERozjJdIMPYuSCQRABkAQb35Zk45MAt3/Ft2LG2YcuhUATBO3Dqcym4ziWJEn+WpqmAYn7ANfK3HVZ/AXiAAAAAElFTkSuQmCC&apos;); background-size: cover; display: block;&quot;
  &gt;&lt;/span&gt;
  &lt;img
        class=&quot;gatsby-resp-image-image&quot;
        alt=&quot;4&quot;
        title=&quot;4&quot;
        src=&quot;/static/ea34c6d579f36e3d0ca5fa8934a5a5d3/fcda8/4.png&quot;
        srcset=&quot;/static/ea34c6d579f36e3d0ca5fa8934a5a5d3/12f09/4.png 148w,
/static/ea34c6d579f36e3d0ca5fa8934a5a5d3/e4a3f/4.png 295w,
/static/ea34c6d579f36e3d0ca5fa8934a5a5d3/fcda8/4.png 590w,
/static/ea34c6d579f36e3d0ca5fa8934a5a5d3/efc66/4.png 885w,
/static/ea34c6d579f36e3d0ca5fa8934a5a5d3/5819f/4.png 1042w&quot;
        sizes=&quot;(max-width: 590px) 100vw, 590px&quot;
        style=&quot;width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;&quot;
        loading=&quot;lazy&quot;
      /&gt;
  &lt;/a&gt;
    &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;길이가 모두 63으로 맞춰졌으며, 빈 공간은 모두 1로 채워진 것을 직접 눈으로 확인해 보았습니다.&lt;/p&gt;
&lt;p&gt;이제 단어들을 고유한 정수로 맵핑하였으니, 각 정수를 고유한 단어 벡터로 바꾸는 작업이 필요합니다. 이 작업을 &lt;strong&gt;임베딩(Embedding)&lt;/strong&gt;이라고 합니다. 단어 벡터를 얻는 방법은 크게 원-핫 인코딩과 워드 임베딩이 있는데, 주로 워드 임베딩이 사용됩니다.&lt;/p&gt;
&lt;p&gt;다음 게시물에서는 원-핫 인코딩은 무엇인지, 또 무슨 문제 때문에 주로 Word2Vec을 사용하는지에 대해 다루도록 하겠습니다.&lt;/p&gt;</content:encoded></item></channel></rss>