Portfolio

Crawling 필요패키지 라이브러리 설치 (대부분 설치되어있다는 가정 하)URL(Uniform Resource Locator)HTTP(Hypertext Transger Protocol)HTTP request ex 1 HTTP request ex 2 HTTP Response ex 1 HTTP Response ex 2 HTTP 처리방식 상태 코드 requets GET 방식으로 parameter 전달하는 방법 POST 요청할 때 data 전달법 헤더 추가, 쿠키 추가 인증추가 requests를 이용한 크롤링 특정 페이지의 소스코드를 파일로 저장 BeautifulSoup Selector 연습문제 라이브러리 설치 (대부분 설치되어있다는 가정 하)문제 1번 문제 2번 문제 3번

Crawling

웹 페이지에 접속해서 정보를 찾는 과정을 프로그램을 통해 찾아 수집하고 원하는 형태에 맞게 가공하는 모든 과정.

사이트의 운영자의 의사에 반하지 않으면 합법이고 그렇지 않으면 불법

사이트 디렉토리의 robots.txt파일을 보면 크롤링을 금지하는지 안하는지표시되어있음 (Disallow라는 표시 있으면 크롤링하면 안 됨)

웹페이지 소스 중 웹 프로그래밍 요소는 저작물로 인정될 수 있으므로 불법 복제는 저작권 침해에 해당.

필요패키지

(필수) pip3 install BeautifulSoup4 or pip3 install bs4

(필수) pip3 install requests

(필수) pip3 install pandas

(필수) pip3 install plotly

(선택) pip3 install lxml

라이브러리 설치 (대부분 설치되어있다는 가정 하)

!pip3 install requests

!pip3 install beautifulsoup4

입력


!pip3 install requests


!pip3 install beautifulsoup4


# mac , Linux 
!ls
# window
!dir

출력


C 드라이브의 볼륨에는 이름이 없습니다.
 볼륨 일련 번호: CC5E-6766

 C:\Users\leehojun\Google 드라이브\11_1. 콘텐츠 동영상 결과물\007. 크롤링 강의 디렉터리

2020-04-03  13:16    <DIR>          .
2020-04-03  13:16    <DIR>          ..
2020-04-03  13:10    <DIR>          .ipynb_checkpoints
2020-04-03  12:18           391,939 001.ipynb
2020-04-03  01:58    <DIR>          참고자료
2020-04-03  13:16               999 최종강의자료_크롤링.ipynb
               2개 파일             392,938 바이트
               4개 디렉터리  13,425,782,784 바이트 남음

URL(Uniform Resource Locator)

자원이 어디 있는지를 알려주기 위한 규약입니다.

흔히 웹 사이트 주소로 알고 있지만, URL은 웹 사이트 주소뿐만 아니라 컴퓨터 네트워크상의 자원을 모두 나타낼 수 있습니다.

그 주소에 접속하려면 해당 URL에 맞는 프로토콜을 알아야 하고, 그와 동일한 프로토콜로 접속합니다. (FTP 프로토콜인 경우에는 FTP 클라이언트를 이용해야 하고, HTTP인 경우에는 웹 브라우저를 이용해야 한다. 텔넷의 경우에는 텔넷 프로그램을 이용해서 접속)

출처 : Wiki

HTTP(Hypertext Transger Protocol)

HTML, XML, Javascript, 오디오, 비디오, 이미지, PDF, Etc

요청 또는 상태 라인 / 해더(생략가능) / 빈줄(해더의 끝) / 바디(생략가능)

HTTP request ex 1


GET /stock.html HTTP/1.1
Host www.paullab.co.kr

HTTP request ex 2


GET /index.html HTTP/1.1
user-agent: MSIE 6.0; Windows NT 5.0
accept: text/html; */*
cookie: name = value
referer: http://www.naver.com
host: www.paullab.co.kr

데이터 처리 방식, 기본 페이지, 프로토콜 버전.

User-Agent: 사용자 웹 브라우저 종류 및 버전 정보.

Accept: 웹 서버로부터 수신되는 데이터 중 웹 브라우저가 처리할 수 있는 데이터 타입을 의미합니다.

여기서 text/html은 text, html 형태의 문서를 처리할 수 있고, /는 모든 문서를 처리할 수 있다는 의미입니다. (이를 MIME 타입이라 부르기도 한다.)

Cookie: HTTP 프로토콜 자체가 세션을 유지하지 않는 State-less(접속상태를 유지하지 않는) 방식이기 때문에 로그인 인증을 위한 사용자 정보를 기억하려고 만든 인위적인 값. 즉 사용자가 정상적인 로그인 인증 정보를 가지고 있다는 것을 판단하고자 사용합니다.

Referer: 현재 페이지 접속 전에 어느 사이트를 경유했는지 알려주는 도메인 혹은 URL 정보입니다.

Host: 사용자가 요청한 도메인정보입니다.

HTTP Response ex 1


HTTP/1.1 200 OK                                    ## 상태라인
Content-Type: application/xhtml+xml; charset=utf-8 ## 해더
                                                   ## 빈줄
<html>                                             ## 바디
...
</html>

HTTP Response ex 2


HTTP/1.1 OK 200 
Server: NCSA/1.4.2
Content-type: text/html
Content-length: 107

<html>
...
</html>

웹 프로토콜 버전 및 응답 코드 정보가 포함됩니다.

웹 애플리케이션 종류 및 버전 정보가 포함됩니다.

MIME 타입 정보가 포함됩니다.

수신 Body 사이즈 정보가 포함됩니다

사용자가 요청한 웹 페이지 정보가 포함됩니다.

HTTP 처리방식

GET : 리소스를 취득합니다. (? 뒤에 이어붙이는 방식 - 작은 값들)

POST : 리소스를 생성합니다. (Body에 붙이는 방식 - 상대적으로 큰 용량)

PUT : 리소스의 수정을 요청합니다.

DELETE : 리소스의 삭제를 요청합니다.

HEAD : HTTP 헤더 정보만 요청하고, 해당 자원 존재 여부 확인의 목적이 있습니다.

OPTIONS : 웹서버가 지원하는 메소드 종류 반환을 요청합니다.

TRACE : 요청 리소스가 수신되는 경로를 확인합니다.

CONNECT : 요청 리소스에 대해 양방향 연결을 시작합니다.

상태 코드

200 : 서버가 요청을 제대로 처리.

201 : 성공적으로 요청되었으며 서버가 새 리소스를 작성.

202 : 서버가 요청을 접수했지만 아직 처리하지 않음.

301 : 요청한 페이지를 새 위치로 영구적으로 이동.

403 : 서버가 요청을 거부.

404 : 서버가 요청한 페이지를 찾을 수 없음.

500 : 서버에 오류가 발생하여 요청을 수행할 수 없음.

503 : 서버가 오버로드되었거나 유지관리를 위해 다운되었기 때문에 현재서버 사용 불가.

출처 : WIKI

입력


import requests
import bs4


requests.__version__ # requests 버전 확인

출력


'2.22.0'

입력


bs4.__version__ # bs4 버전 확인

출력


'4.7.1'

입력


from datetime import datetime
 
datetime.now() # 현재 시간 출력

출력


datetime.datetime(2020, 4, 3, 13, 36, 59, 789815)

requets

HTTP 요청을 보내는데 사용하는 라이브러리

.text : str타입의 데이터를 return

.headers : header(key/value 형식으로 데이터 저장)의 내용 확인

.encoding : 인코딩 방식 확인

.status_code : HTTP 요청에 대해서 요청이 성공했는지 실패했는지 혹은 어떤 상태인지 말해줌

.ok : 데이터를 잘 불러오고 있는지 확인

입력


import requests

html = requests.get('http://www.paullab.co.kr/stock.html')
html

출력


<Response [200]> # 서버가 요청을 제대로 처리

입력


html.text # 한글 깨지는 현상이 발생

출력


'<!DOCTYPE html>\n<html lang="en">\n\n<head>\n  
<meta charset="UTF-8">\n  
<meta name="viewport" content="width=device-width, initial-scale=1.0">\n 
<meta http-equiv="X-UA-Compatible" content="ie=edge">\n  
<title>Document</title>\n  <link rel="stylesheet" 
href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css">\n  
<style>\n    h1{\n      margin: 2rem;\n    }\n    
h1>span{\n      font-size: 1rem;\n    }\n    
.main {\n      width: 70%;\n      margin: 3rem auto auto auto;\n      
text-align: center\n    }\n\n    table {\n      width: 100%;\n    }\n  
</style>\n</head>\n\n<body>\n  
<h1>í\x81¬ë¡¤ë§\x81 ì\x97°ì\x8aµì\x9a© í\x8e\x98ì\x9d´ì§\x80

							 ...

<td class="num"><span>139,085</span></td>\n        
</tr>\n      
</tbody>\n    
</table>\n  </div>\n</body>\n\n</html>\n'

입력


html.headers

출력


{'Server': 'nginx', 'Date': 'Sat, 04 Apr 2020 11:13:06 GMT', 
'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 
'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 
'P3P': "CP='NOI CURa ADMa DEVa TAIa OUR DELa BUS IND PHY ONL UNI COM NAV INT DEM PRE'", 
'X-Powered-By': 'PHP/5.5.17p1', 'Content-Encoding': 'gzip'}

입력


dir(html)

출력


['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',

	...

 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

입력


#ASCII 기반의 확장 인코딩 방식
html.encoding

출력


'ISO-8859-1'

입력


html.encoding = 'utf-8' # 한글 출력


html.text

출력


'<!DOCTYPE html>\n<html lang="en">\n\n<head>\n  
<meta charset="UTF-8">\n  <meta name="viewport" 
content="width=device-width, initial-scale=1.0">\n  
<meta http-equiv="X-UA-Compatible" content="ie=edge">\n  
<title>Document</title>\n  <link rel="stylesheet" href=
"https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css">
\n  <style>\n    h1{\n      margin: 2rem;\n    }\n    
h1>span{\n      font-size: 1rem;\n    }\n    .main {\n      width: 70%;\n      
margin: 3rem auto auto auto;\n      text-align: center\n    }\n\n   
table {\n      width: 100%;\n    }\n  </style>\n</head>\n\n<body>\n  
<h1>크롤링 연습용 페이지

         ...

<td class="num"><span>139,085</span></td>\n        
</tr>\n      
</tbody>\n    
</table>\n  </div>\n</body>\n\n</html>\n'

입력


html.status_code

출력


200 # 200 : 성공했다는 의미

입력


html.ok

출력


True

GET 방식으로 parameter 전달하는 방법

입력


<html>
<head>
</head>
<body>
  <form action="test.html" method="GET">
    <input type="text" name="user_id">
    <input type="password" name="user_pw">
    <input type="submit" name="submit">
  </form>
</body>
</html>


params = {'pa1': 'val1', 'pa2': 'value2'} 
response = requests.get('http://www.paullab.co.kr', params=params)


response.url

출력


'http://www.paullab.co.kr/?pa1=val1&pa2=value2'

POST 요청할 때 data 전달법


import requests, json

data = {'pa1': 'val1', 'pa2': 'value2'} 
response = requests.post('http://www.paullab.co.kr', data=json.dumps(data))

헤더 추가, 쿠키 추가


headers = {'Content-Type': 'application/json; charset=utf-8'} 
cookies = {'session_id': 'sorryidontcare'} 
response = requests.get('http://www.paullab.co.kr', headers=headers, cookies=cookies)

인증추가


response = requests.post('http://www.paullab.co.kr', auth=("id","pass"))

requests를 이용한 크롤링

입력


import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.paullab.co.kr/stock.html')
response.encoding = 'utf-8'
html = response.text

soup = BeautifulSoup(html, 'html.parser') # 원하는 문자열로 잘라줌


print(soup.prettify()) # html 문서형식으로 출력

출력


<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="ie=edge" http-equiv="X-UA-Compatible"/>
  <title>
   Document
  </title>

			...

<td class="num">
       <span>
        139,085
       </span>
      </td>
     </tr>
    </tbody>
   </table>
  </div>
 </body>
</html>

특정 페이지의 소스코드를 파일로 저장

입력


import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.paullab.co.kr/stock.html')
response.encoding = 'utf-8'
html = response.text
# url 코드를 파일로 저장
f = open('test.html', 'w', encoding='utf-8')
f.write(html)
f.close()


!dir

출력


Out[-]
C 드라이브의 볼륨에는 이름이 없습니다.
 볼륨 일련 번호: CC5E-6766

 C:\Users\leehojun\Google 드라이브\11_1. 콘텐츠 동영상 결과물\007. 크롤링 강의 디렉터리

2020-04-03  13:55    <DIR>          .
2020-04-03  13:55    <DIR>          ..
2020-04-03  13:10    <DIR>          .ipynb_checkpoints
2020-04-03  13:50           392,338 001.ipynb
2020-04-03  13:55            48,527 test.html # test.html이 생성되는것을 확인 
2020-04-03  01:58    <DIR>          참고자료
2020-04-03  13:54           221,527 최종강의자료_크롤링.ipynb
               3개 파일             662,392 바이트
               4개 디렉터리  13,301,800,960 바이트 남음

입력


# url 파일에서 특정단어 찾기
s = html.split(' ') # 띄어쓰기 단위로 분할 # 앞뒤로 띄어쓰기 안되어 있으면 검색이 안됨
word = input('페이지에서 검색할 단어를 입력하세요 : ')
s.count(word)

출력


페이지에서 검색할 단어를 입력하세요 : 제주
0

BeautifulSoup

str타입의 html 데이터를 html 구조를 가진 데이터로 가공해주는 라이브러리

BeautifulSoup(markup, "html.parser")

BeautifulSoup(markup, "lxml")

BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")

BeautifulSoup(markup, "html5lib")

입력


import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.paullab.co.kr/stock.html')

response.encoding = 'utf-8'
html = response.text

soup = BeautifulSoup(html, 'html.parser')


soup.title # title 태그 출력

출력


<title>Document</title>

입력


soup.title.string # title태그에서 문자열만 출력

출력


'Document'

입력


soup.title.text # String 같은 기능

출력


'Document'

입력


soup.title.parent.name # 부모 태그

출력


'head'

입력


soup.tr # table row

출력


<tr>
<th scope="col">날짜</th>
<th scope="col">종가</th>
<th scope="col">전일비</th>
<th scope="col">시가</th>
<th scope="col">고가</th>
<th scope="col">저가</th>
<th scope="col">거래량</th>
</tr>

입력


soup.td # table data

출력


<td align="center "><span class="date">2019.10.23</span></td>

입력


soup.th # table header cell

출력


<th scope="col">날짜</th>

입력


soup.table

출력


<table class="table table-hover">
<tbody>
<tr>
<th scope="col">날짜</th>
<th scope="col">종가</th>
<th scope="col">전일비</th>
<th scope="col">시가</th>
<th scope="col">고가</th>
<th scope="col">저가</th>
<th scope="col">거래량</th>
</tr>
<tr>
<td align="center "><span class="date">2019.10.23</span></td>
<td class="num"><span>6,650</span></td>

									...

</td>
<td class="num"><span>5,300</span></td>
<td class="num"><span>5,370</span></td>
<td class="num"><span>5,280</span></td>
<td class="num"><span>211,019</span></td>
</tr>
</tbody>
</table>

입력


soup.find('title') # find() : 조건에 맞는 하나의 태그를 출력

출력


<title>Document</title>

입력


soup.find('tr')

출력


<tr>
<th scope="col">날짜</th>
<th scope="col">종가</th>
<th scope="col">전일비</th>
<th scope="col">시가</th>
<th scope="col">고가</th>
<th scope="col">저가</th>
<th scope="col">거래량</th>
</tr>

입력


soup.find('th')

출력


<th scope="col">날짜</th>

입력


soup.find(id=('update')).text # 특정 id의 text 출력

출력


'update : 20.12.30'

입력


soup.find('head').find('title') # head 안에 title 출력

출력


<title>Document</title>

입력


soup.find('h2', id='제주코딩베이스캠프연구원')
# h2의 'id가 제주코딩베이스캠프연구원'를 출력

출력


<h2 id="제주코딩베이스캠프연구원">제주코딩베이스캠프 연구원</h2>

입력


soup.find_all('h2') # find_all() : 조건에 맞는 모든 태그들을 출력

출력


[<h2 id="제주코딩베이스캠프연구원">제주코딩베이스캠프 연구원</h2>,
 <h2 id="제주코딩베이스캠프공업">제주코딩베이스캠프 공업</h2>,
 <h2 id="제주코딩베이스캠프출판사">제주코딩베이스캠프 출판사</h2>,
 <h2 id="제주코딩베이스캠프학원">제주코딩베이스캠프 학원</h2>]

입력


soup.find_all('h2')[0]

출력


<h2 id="제주코딩베이스캠프연구원">제주코딩베이스캠프 연구원</h2>

입력


soup.find_all('table', class_='table') # class_ : 예약어
# 예약어 : 특정한 기능을 수행하도록 미리 예약되어 있는것

출력


[<table class="table table-hover">
 <tbody>
 <tr>
 <th scope="col">날짜</th>
 <th scope="col">종가</th>
 <th scope="col">전일비</th>
 <th scope="col">시가</th>
 <th scope="col">고가</th>
 <th scope="col">저가</th>
 <th scope="col">거래량</th>
 </tr>
 <tr>
 <td align="center "><span class="date">2019.10.23</span></td>
 <td class="num"><span>6,650</span></td>

				     ...

</td>
 <td class="num"><span>2,020</span></td>
 <td class="num"><span>2,090</span></td>
 <td class="num"><span>2,020</span></td>
 <td class="num"><span>139,085</span></td>
 </tr>
 </tbody>
 </table>]

입력


soup = BeautifulSoup('''
<hojun id='jeju' class='codingBaseCamp codingLevelUp'>
   hello world
</hojun>
''')
# tag = hojun , id = 'jeju' , class = 'codingBaseCamp codingLevelUp'
tag = soup.hojun
tag

출력


<hojun class="codingBaseCamp codingLevelUp" id="jeju">
   hello world
</hojun>

입력


type(tag)

출력


bs4.element.Tag

입력


dir(tag) # tag의 method

출력


['HTML_FORMATTERS',
 'XML_FORMATTERS',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',

   ...

 'setup',
 'string',
 'strings',
 'stripped_strings',
 'text',
 'unwrap',
 'wrap']

입력


tag.name

출력


'hojun'

입력


tag['class']

출력


['codingBaseCamp', 'codingLevelUp']

입력


tag['id']

출력


'jeju'

입력


tag.attrs # 정보를 한번에 보고싶을때 사용

출력


{'id': 'jeju', 'class': ['codingBaseCamp', 'codingLevelUp']}

입력


tag.string # 문자열 출력

출력


'\n   hello world\n'

입력


tag.text # 문자열 출력

출력


'\n   hello world\n'

입력


tag.contents # list로 출력

출력


['\n   hello world\n']

입력


for i in tag.children: # children : 종속 태그
    print(i)

출력


hello world

입력


tag.children

출력


<list_iterator at 0x1cd52f37550>

입력


soup = BeautifulSoup('''
<ul>
    <li id='jeju' class='codingBaseCamp codingLevelUp'>hello world</li>
    <li id='jeju' class='codingBaseCamp codingLevelUp'>hello world</li>
    <li id='jeju' class='codingBaseCamp codingLevelUp'>hello world</li>
</ul>
''')
tag = soup.ul
tag

출력


<ul>
<li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>
<li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>
<li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>
</ul>

입력


tag.contents # list

출력


['\n',
 <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>,
 '\n',
 <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>,
 '\n',
 <li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>,
 '\n']

입력


tag.li # li 태그 출력

출력


<li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>

입력


tag.li.parent # li 태그의 부모 태그 출력

출력


<ul>
<li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>
<li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>
<li class="codingBaseCamp codingLevelUp" id="jeju">hello world</li>
</ul>

Selector

태그에 좀 더 세밀한 접근이 가능

class를 지칭할 때는 '.'을 사용하고, id를 지칭할 때는 '#'를 사용

탐색하고자 하는 태그가 특정태그 하위에 있을 때 '>'를 사용

입력


import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.paullab.co.kr/stock.html')

response.encoding = 'utf-8'
html = response.text

soup = BeautifulSoup(html, 'html.parser')


soup.select('#update')

출력


[<span id="update">update : 20.12.30</span>]

입력


soup.select('.table > tr') # 'table' class 안에 모든 tr 태그 출력
# 순서 : table > tbody > tr (바로 아래 아니면 실행안됨)

출력

[]

입력


soup.select('.table > tbody > tr')[2] # 'table' class 안에 tbody 안에 모든 tr 태그 출력

출력


<tr>
<td align="center"><span class="date">2019.10.22</span></td>
<td class="num"><span>6,630</span></td>
<td class="num">
<img alt="하락" height="6" src="ico_down.gif" style="margin-right:4px;" width="7"/>
<span class="tah p11 nv01">
                            190
                        </span>
</td>
<td class="num"><span>6,830</span></td>
<td class="num"><span>6,930</span></td>
<td class="num"><span>6,530</span></td>
<td class="num"><span>919,571</span></td>
</tr>

입력


# 요소 선택 방법
soup.select("p > a:nth-of-type(2)") # p > a tag 인데 2번째 요소
soup.select("p > a:nth-child(even)") # p > a tag 인데 짝수나 홀수번째 요소 
soup.select('a[href]') # 특정 attribute 요소
soup.select("#link1 + .sister") # id와 class를 동시에 가진 요소


oneStep = soup.select('.main')[0] # 제주코딩베이스캠프 연구원
oneStep

출력


<div class="main">
<h2 id="제주코딩베이스캠프연구원">제주코딩베이스캠프 연구원</h2>
<h3><span style="color: salmon">일별</span> 시세</h3>
<table class="table table-hover">
<tbody>
<tr>
<th scope="col">날짜</th>
<th scope="col">종가</th>
<th scope="col">전일비</th>
<th scope="col">시가</th>
<th scope="col">고가</th>
<th scope="col">저가</th>
<th scope="col">거래량</th>
</tr>

        ...

<td class="num"><span>5,300</span></td>
<td class="num"><span>5,370</span></td>
<td class="num"><span>5,280</span></td>
<td class="num"><span>211,019</span></td>
</tr>
</tbody>
</table>
</div>

입력


twoStep = oneStep.select('tbody > tr')[1:] 
twoStep

출력


<tr>
 <td align="center "><span class="date">2019.10.23</span></td>
 <td class="num"><span>6,650</span></td>
 <td class="num">
 <img alt="상승 " height="6 " src="ico_up.gif " style="margin-right:4px; " width="7 "/>
 <span>
                             20
                         </span>

														...

														 10
                         </span>
 </td>
 <td class="num"><span>5,300</span></td>
 <td class="num"><span>5,370</span></td>
 <td class="num"><span>5,280</span></td>
 <td class="num"><span>211,019</span></td>
 </tr>]

입력


twoStep[0].select('td')[0].text # 날짜

출력


'2019.10.23'

입력


twoStep[0].select('td')[1].text  # 종가

출력


'6650'   # 문자형이기때문에 만약 계산에 이용하게 된다면 숫자형으로 바꿔줘야함.
# twoStep[0].select('td')[1].text.replace(',', '')

입력


날짜 = []
종가 = []

for i in twoStep:
    날짜.append(i.select('td')[0].text)
    종가.append(int(i.select('td')[1].text.replace(',', '')))


날짜

출력


['2019.10.23',
 '2019.10.22',
 '2019.10.21',
 '2019.10.18',
 '2019.10.17',
 '2019.10.16',
 '2019.10.15',
 '2019.10.14',
 '2019.10.11',
 '2019.10.10',
 '2019.10.08',
 '2019.10.07',
 '2019.10.04',
 '2019.10.02',
 '2019.10.01',
 '2019.09.30',
 '2019.09.27',
 '2019.09.26',
 '2019.09.25',
 '2019.09.24']

입력


종가

출력

입력


# 시각화 
# 날짜별로 가격 변동 추이 
import plotly.express as px

fig = px.line(x=날짜, y=종가, title='jejucodingcamp')
fig.show()

출력

연습문제

라이브러리 설치 (대부분 설치되어있다는 가정 하)

!pip3 install requests

!pip3 install beautifulsoup4

크롤링 URL : http://www.paullab.co.kr/stock.html

문제 1번

각 회사별 1만주씩 있다고 가정했을 때, 전그룹사 시가총액을 구해주세요.

그룹사 : [ 제주코딩베이스캠프 연구원, 제주코딩베이스캠프 공업, 제주코딩베이스캠프 출판사, 제주코딩베이스캠프 학원]

입력


import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.paullab.co.kr/stock.html")

response.encoding = 'utf-8'
html = response.text

soup = BeautifulSoup(html, 'html.parser')


soup.select('.main')[0] # 제주코딩베이스캠프 연구원
soup.select('.main')[1] # 제주코딩베이스캠프 공업
soup.select('.main')[2] # 제주코딩베이스캠프 출판사
soup.select('.main')[3] # 제주코딩베이스캠프 학원

출력


<div class="main">
<h2 id="제주코딩베이스캠프학원">제주코딩베이스캠프 학원</h2>
<h3><span style="color: salmon">일별</span> 시세</h3>
<table class="table table-hover">
<tbody>
<tr>
<th scope="col">날짜</th>
<th scope="col">종가</th>
<th scope="col">전일비</th>
<th scope="col">시가</th>
<th scope="col">고가</th>
<th scope="col">저가</th>
<th scope="col">거래량</th>
</tr>

		...

<td class="num"><span>2,020</span></td>
<td class="num"><span>2,090</span></td>
<td class="num"><span>2,020</span></td>
<td class="num"><span>139,085</span></td>
</tr>
</tbody>
</table>
</div>

입력


그룹사별일일시가 = soup.select('.main')
오늘종가 = []
오늘시가총액 = []

for i in 그룹사별일일시가:
    print(i.select('.table > tbody > tr')[1].select('td')[1])
    print(i.select('.table > tbody > tr')[1].select('td')[1].text)
    print(i.select('.table > tbody > tr')[1].select('td')[1].text.replace(',', ''))

출력


<td class="num"><span>6,650</span></td> # 10.23일, 제주코딩베이스캠프 연구원, 종가 
6,650
6650
<td class="num"><span>31,300</span></td> # 10.23일, 제주코딩베이스캠프 공업, 종가
31,300
31300
<td class="num"><span>13,250</span></td> # 10.23일, 제주코딩베이스캠프 출판사, 종가
13,250
13250
<td class="num"><span>2,600</span></td> # 10.23일, 제주코딩베이스캠프 학원, 종가
2,600
2600

입력


그룹사별일일시가 = soup.select('.main')
오늘종가 = []
오늘시가총액 = []

for i in 그룹사별일일시가:
    오늘종가.append(int(i.select('.table > tbody > tr')[1].select('td')[1].
										select('td > span')[0].text.replace(',', '')))
print(오늘종가)

출력


[6650, 31300, 13250, 2600]

입력


오늘시가총액 = [i*10000 for i in 오늘종가]
전그룹사시가총액 = format(sum(오늘시가총액), ',')
전그룹사시가총액

출력


'538,000,000'

문제 2번

전그룹사 시가총액 추이를 그래프로 그려주세요. x축은 날짜, y축은 가격입니다.

입력


# 각그룹사의 일일 시가총액을 구함
그룹사별일일시가 = soup.select('.main')
오늘종가 = []
오늘시가총액 = []

for i in 그룹사별일일시가: 
    오늘종가.append(int(i.select('.table > tbody > tr')[1].select('td')[1].
										select('td > span')[0].text.replace(',', '')))


그룹사별일일시가 = soup.select('.main')
오늘종가 = []
오늘시가총액 = []
for j in range(1, len(soup.select('.main')[0].select('table > tbody > tr'))):
    오늘종가 = []
    for i in 그룹사별일일시가:
        오늘종가.append(int(i.select('.table > tbody > tr')[j].select('td')[1].
												select('td > span')[0].text.replace(',', '')))
    오늘시가총액.append(sum(오늘종가))


오늘시가총액

출력

입력


# 날짜 table 크롤링 
날짜전체 = soup.select('.main')[0].select('.table > tbody > tr > td > .date')
date = []
for i in 날짜전체:
    date.append(i.text)
date

출력


['2019.10.23',
 '2019.10.22',
 '2019.10.21',
 '2019.10.18',
 '2019.10.17',
 '2019.10.16',
 '2019.10.15',
 '2019.10.14',
 '2019.10.11',
 '2019.10.10',
 '2019.10.08',
 '2019.10.07',
 '2019.10.04',
 '2019.10.02',
 '2019.10.01',
 '2019.09.30',
 '2019.09.27',
 '2019.09.26',
 '2019.09.25',
 '2019.09.24']

입력


%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(date, 오늘시가총액)
plt.xticks(rotation = -45 ) # y 축 변수 기울기 설정 
plt.show()

출력

입력


%matplotlib inline
# 날짜순을 정렬하여 재출력 
import matplotlib.pyplot as plt

plt.plot(date[::-1], 오늘시가총액[::-1]) 
plt.xticks(rotation = -45 )
plt.show()

출력

입력


import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.paullab.co.kr/stock.html")

response.encoding = 'utf-8'
html = response.text

soup = BeautifulSoup(html, 'html.parser')

출력


[538000000,
 531800000,
 536150000,
 523050000,
 490350000,
 487550000,
 469700000,
 461400000,
 459000000,
 457650000,
 440000000,
 432100000,
 438300000,
 443100000,
 448500000,
 443700000,
 439350000,
 441800000,
 444100000,
 462450000]

입력


# 날짜 table 크롤링
날짜 = soup.select('.main')[0].select('.table > tbody > tr > td > .date')
date = []
for i in 날짜:
    date.append(i.text)
date

출력


['2019.10.23',
 '2019.10.22',
 '2019.10.21',
 '2019.10.18',
 '2019.10.17',
 '2019.10.16',
 '2019.10.15',
 '2019.10.14',
 '2019.10.11',
 '2019.10.10',
 '2019.10.08',
 '2019.10.07',
 '2019.10.04',
 '2019.10.02',
 '2019.10.01',
 '2019.09.30',
 '2019.09.27',
 '2019.09.26',
 '2019.09.25',
 '2019.09.24']

입력


# 시각화 
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(date[::-1], 오늘시가총액[::-1])
plt.xticks(rotation = -45)
plt.show()

출력

문제 3번

각 회사별 거래 총량과 전그룹사 거래 총량을 subplot으로 그려주세요.

입력


그룹사별일일데이터 = soup.select('.main')
그룹사별일일거래량 = [[],[],[],[]]
그룹사전체일일거래량 = []
# 데이터 구조 : 
# 그룹사별일일거래량 = [[출판사], [연구원], [공업사], [학원]]


그룹사별일일데이터[0].select('.table > tbody > tr')[0]
그룹사별일일데이터[0].select('.table > tbody > tr')[1].select('td')[-1].text.replace(',','')

출력


'398421'

입력


for j in range(1, len(soup.select('.main')[0].select('table > tbody > tr'))):
    그룹사별일일거래량[0].append(int(그룹사별일일데이터[0].select('.table > tbody > tr')[j].
                            select('td')[-1].text.replace(',', '')))
    그룹사별일일거래량[1].append(int(그룹사별일일데이터[1].select('.table > tbody > tr')[j].
                            select('td')[-1].text.replace(',', '')))
    그룹사별일일거래량[2].append(int(그룹사별일일데이터[2].select('.table > tbody > tr')[j].
                            select('td')[-1].text.replace(',', '')))
    그룹사별일일거래량[3].append(int(그룹사별일일데이터[3].select('.table > tbody > tr')[j].
                            select('td')[-1].text.replace(',', '')))


그룹사별일일거래량[0] # 출판사
그룹사별일일거래량[1] # 연구원
그룹사별일일거래량[2] # 공업사
그룹사별일일거래량[3] # 학원
len(그룹사별일일거래량[0])
그룹사별일일거래량[0]

출력

입력


%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(date[::-1], 그룹사별일일거래량[0][::-1], label='A')
plt.plot(date[::-1], 그룹사별일일거래량[1][::-1], label='B')
plt.plot(date[::-1], 그룹사별일일거래량[2][::-1], label='C')
plt.plot(date[::-1], 그룹사별일일거래량[3][::-1], label='D')
plt.xticks(rotation = -45 )
plt.legend(loc=2)
plt.show()

출력

입력


for i in range(len(그룹사별일일거래량[0])):
    s = 0
    for j in range(4):
        s += 그룹사별일일거래량[j][i]
    그룹사전체일일거래량.append(s)
그룹사전체일일거래량

출력

입력


%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(date[::-1], 그룹사전체일일거래량[::-1], label='ALL')
plt.xticks(rotation = -45 )
plt.legend(loc=2)
plt.show()

출력

입력


f = plt.figure(figsize=(10,3))
# 1번 그림 (그룹사별)
ax = f.add_subplot(121)
ax.plot(date[::-1], 그룹사별일일거래량[0][::-1], label='A')
ax.plot(date[::-1], 그룹사별일일거래량[1][::-1], label='B')
ax.plot(date[::-1], 그룹사별일일거래량[2][::-1], label='C')
ax.plot(date[::-1], 그룹사별일일거래량[3][::-1], label='D')
plt.xticks(rotation = -45)
ax.legend(loc=2)
# 2번 그림 (전체)
ax2 = f.add_subplot(122)
ax2.figsize=(15,15)
ax2.plot(date[::-1], 그룹사전체일일거래량[::-1], label='ALL')
plt.xticks(rotation = -45)
ax2.legend(loc=2)

출력

4.8.4. Crawling

Crawling

필요패키지

라이브러리 설치 (대부분 설치되어있다는 가정 하)

URL(Uniform Resource Locator)

HTTP(Hypertext Transger Protocol)

HTTP request ex 1

HTTP request ex 2

HTTP Response ex 1

HTTP Response ex 2

HTTP 처리방식

상태 코드

requets

GET 방식으로 parameter 전달하는 방법

POST 요청할 때 data 전달법

헤더 추가, 쿠키 추가

인증추가

requests를 이용한 크롤링

특정 페이지의 소스코드를 파일로 저장

BeautifulSoup

Selector

연습문제

라이브러리 설치 (대부분 설치되어있다는 가정 하)

문제 1번

문제 2번

문제 3번