using websocket for data scraping

by wiki   Last Updated October 10, 2019 02:26 AM

I want to scrape some of the data from here which is implemented based on websockets. So after inspecting the Chrome DevTools for wss address and header:

enter image description here

and the negotiation message:

enter image description here

I wrote:

from websocket import create_connection

headers = {
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9,fa;q=0.8',
    'Cache-Control': 'no-cache',
    'Connection': 'Upgrade',
    'Host': 'stream179.forexpros.com',
    'Origin': 'https://www.investing.com',
    'Pragma': 'no-cache',
    'Sec-WebSocket-Extensions': 'client_max_window_bits',
    'Sec-WebSocket-Key': 'ldcvnZNquzPkSNvpSdI09g==',
    'Sec-WebSocket-Version': '13',
    'Upgrade': 'websocket',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
}

ws = create_connection('wss://stream179.forexpros.com/echo/894/l27e2ja8/websocket', header=headers)

nego_message = '''["{\"_event\":\"bulk-subscribe\",\"tzID\":8,\"message\":\"pid-1:%%pid-8839:%%pid-166:%%pid-20:%%pid-169:%%pid-170:%%pid-44336:%%pid-27:%%pid-172:%%pid-2:%%pid-3:%%pid-5:%%pid-7:%%pid-9:%%pid-10:%%pid-945629:%%pid-11:%%pid-16:%%pid-68:%%pidTechSumm-1:%%pidTechSumm-2:%%pidTechSumm-3:%%pidTechSumm-5:%%pidTechSumm-7:%%pidTechSumm-9:%%pidTechSumm-10:%%pidExt-1:%%event-393634:%%event-393633:%%event-393636:%%event-393638:%%event-394479:%%event-394518:%%event-394514:%%event-394516:%%event-394515:%%event-394517:%%event-393654:%%event-394467:%%event-393653:%%event-394468:%%event-394545:%%event-394549:%%event-394548:%%event-394547:%%event-394550:%%event-394546:%%event-394551:%%event-394553:%%event-394552:%%event-394743:%%event-394744:%%event-393661:%%event-394469:%%event-394470:%%event-393680:%%event-393682:%%event-393681:%%event-393687:%%event-393694:%%event-393685:%%event-393689:%%event-393688:%%event-393695:%%event-393698:%%event-393704:%%event-393705:%%event-393724:%%event-393723:%%event-393725:%%event-393726:%%event-394591:%%event-393736:%%event-393733:%%event-393734:%%event-393740:%%event-393731:%%event-393732:%%event-393730:%%event-394617:%%event-394616:%%event-393737:%%event-378304:%%event-393645:%%event-394619:%%event-393755:%%event-393757:%%event-393760:%%event-393756:%%event-393758:%%event-393759:%%event-393761:%%event-393762:%%event-394481:%%event-394625:%%event-393754:%%event-394483:%%event-393775:%%event-394621:%%event-394622:%%event-376710:%%event-394623:%%event-394484:%%event-394624:%%isOpenExch-1:%%isOpenExch-2:%%isOpenExch-13:%%isOpenExch-3:%%isOpenExch-4:%%isOpenPair-1:%%isOpenPair-8839:%%isOpenPair-44336:%%cmt-1-5-1:%%domain-1:\"}"]'''

ws.send(nego_message)

while True:
    print(ws.recv())

but I'm getting:

o

Traceback (most recent call last):
  File "test.py", line 647, in <module>
    print(ws.recv())
  File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 313, in recv
    opcode, data = self.recv_data()
  File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 330, in recv_data
    opcode, frame = self.recv_data_frame(control_frame)
  File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 343, in recv_data_frame
    frame = self.recv_frame()
  File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 377, in recv_frame
    return self.frame_buffer.recv_frame()
  File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_abnf.py", line 361, in recv_frame
    self.recv_header()
  File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_abnf.py", line 309, in recv_header
    header = self.recv_strict(2)
  File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_abnf.py", line 396, in recv_strict
    bytes_ = self.recv(min(16384, shortage))
  File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_core.py", line 452, in _recv
    return recv(self.sock, bufsize)
  File "C:\Users\me\AppData\Local\Programs\Python\Python37\lib\site-packages\websocket\_socket.py", line 115, in recv
    "Connection is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: Connection is already closed.
[Finished in 1.9s]

What am I missing here?



Answers 1


The while loop is calling ws.recv() twice. If you simply do:

print(ws.recv())

It will not attempt to call .recv() on a closed connection. The result of your message output is printing o before the stack trace.

As an aside, it seems like you might want a longer running connection using websocket.WebSocketApp (example) for a scrape.

Tomanow
Tomanow
October 10, 2019 02:24 AM

Related Questions


Updated February 24, 2017 10:26 AM

Updated February 28, 2019 17:26 PM

Updated October 03, 2019 04:26 AM

Updated February 26, 2017 21:26 PM

Updated February 21, 2018 16:26 PM