points by esprehn 3 years ago

This isn't really JS, it's a purpose built evaluator that's only for evaluating a particular script on YouTube, assuming a huge list of things are true about how YouTube JS is written.

Ex. Its got a hard coded list of methods for String, and it doesn't respect prototypes. It only supports creating Date instances, and won't work if you override the global Date. It parses with regexes and implements all operators with python's operator module (which is the wrong type semantics) etc. Nearly none of the semantics of JS are implemented.

It's sort of the sandwich categorization problem:

If I write a C# "interpreter" in perl thats only 200 lines and just handles string.Join, string.Concat and Console.WriteLine, and it doesn't actually try to implement C# syntax or semantics at all and just uses perl semantics for those operations is it actually C#? :P

I say "not a sandwich".

jraph 3 years ago

And as a user of youtube-dl, I'm quite happy about this. This probably allows a very safe, restricted "subset" of JS. Way better than using a full JS engine. 900 lines is still small and manageable.

  • sebzim4500 3 years ago

    I'm trying to get the thread model here. Is the concern that Youtube will inject JS into the payload which tries to break out of the youtuble-dl js sandbox using some zero day in whatever js engine they would use instead?

    • kevingadd 3 years ago

      Embedding a whole js engine and then interopping with it from python would be non trivial. Good luck fixing any bugs or corner cases you hit that way. The V8 and spidermonkey embedding apis are both c++ (iirc) and non trivial to use correctly.

      Having full control like this +simple code is probably lower risk and more maintainable, even if there's the challenge of expanding feature set if scripts change.

      The alternative would be a console js shell, but those are very different from browsers so that poses it's own challenges.

      • em-bee 3 years ago

        apparently yt-dlp is somehow calling out to a js engine if available

        • kevingadd 3 years ago

          Yeah, it's possible to install v8 or spidermonkey shells and use them to run code - we use them to run parts of the .NET wasm test suite - but they have a bunch of arbitrary limitations, so if you're trying to emulate a browser I'm not sure I'd bet on them. It's certainly going to be easier than a C++ embedding, so it makes sense that they took that route.

          Another option is to use node, but it also has weird limitations/behaviors when running code.

      • lloeki 3 years ago

        > Embedding a whole js engine and then interopping with it from python would be non trivial.

        Cue libv8-node+mini_racer from which PyMiniRacer was born. It is non-trivial but not as hard as one might think.

        The most painful part is the libv8 build system and Google-centric tooling (depot tools!), which makes it an absolute PITA for libv8 consumers that are not Google/Chrome.

        This is why the libv8 gem was atrocious to keep up to date and to build for several platforms, and why libv8-node was born, because the node build system and source distribution are actually sane (props to their relentless work on which we piggyback on)

        Disclaimer: worked at Sqreen, now maintainer of libv8-node and collaborator of mini_racer

        https://github.com/sqreen/PyMiniRacer

        https://github.com/rubyjs/mini_racer

        https://github.com/rubyjs/libv8-node

        • kevingadd 3 years ago

          Very cool, I'll have to remember that this exists! Looks useful.

    • jraph 3 years ago

      Let's say they end up using Node. Node has a quite complete standard library that lets you access files and everything.

      Now if they do it right and only embed some bare JS interpreter, it's still way harder to audit than these < 900 lines, for which it is quite easy to convince oneself that the interpreted script cannot do much.

      • geysersam 3 years ago

        Nowadays they could probably use Deno. Without permissions it doesn't allow network or file access etc.

    • rwmj 3 years ago

      Google attempting zero days on client computers would be something. It's not totally without precedent (Sony CD rootkits - https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...) but would still be major news.

      • btown 3 years ago

        While they likely wouldn't do a zero-day, their JS files, particularly for automated captchas, do push the boundaries of whatever JS engine they're executed inside. See https://github.com/neuroradiology/InsideReCaptcha#the-analys... and note that this analysis is 8 years old. While there's minimal risk if you're either using a full-fledged modern JS engine or a limited-subset interpreter like the OP, an older or non-optimized spec-compliant JS engine might hit pathological performance cases and result in you DOSing yourself.

        • saagarjha 3 years ago

          I mean the DOS would be that your youtube-dl invocation hangs, and then you kill it.

        • origin_path 3 years ago

          It's interesting to speculate about why they don't use this much more powerful technology to stop ytdl but instead use this much weaker yt specific thing.

          Most likely the reason is that they keep the botguard system for the stuff that matters to them a lot more like account signups and click fraud, and don't want to incentivize the ytdl guys to break it on behalf of spammers/clickfraudsters.

    • loeg 3 years ago

      youtube-dl targets a lot of websites other than Google properties, many of which are a lot sketchier (think, uh, NSFW streaming sites).

    • pabs3 3 years ago

      One of the reasons people use yt-dlp/youtube-dl (and nitter.net/etc) is to transform the modern proprietary JavaScript web into something more suitable for enthusiasts of the old document web and of FOSS. If the web switched to plain <video> then yt-dlp/youtube-dl would become completely unnecessary. Your browser should not have to run JS to watch an embedded video.

      • nyanpasu64 3 years ago

        On my Ivy Bridge laptop running Linux, enabling hardware video decode in mpv took installing one package and adding one line to mpv.conf. Enabling hardware decoding in Firefox took multiple attempts of Googling frantically, toggling flags in about:config, passing logging environment variables to Firefox, recording a Pernosco trace of multi-process communication, and even asking for help in the gfx-firefox Matrix chat where they pointed out I had disabled media.rdd-process.enabled causing Firefox to print a misleading error message in about:support saying HARDWARE_VIDEO_DECODING was available, but failing at runtime saying WebRender was disabled even though it was enabled. And to my knowledge, hardware decoding in Chromium is simply not possible on Linux right now (maybe possible on Chromebooks, I haven't checked).

        Even after I fixed hardware acceleration, playing a 1080p YouTube video in Firefox using hardware H.264 decoding took more CPU energy (40% of a core) than playing the same video in mpv using software H.264 decoding (20% of a core). Web browsers are just horrifically complex, intractable to understand, and inefficient.

  • mjevans 3 years ago

    yt-dlp sometimes doesn't know how to evaluate the javascript / emcascript and will call out to an optional dependency, a real javascript interpreter, if installed.

  • jiggawatts 3 years ago

    That’s the exact same logic I hear from developers who say things like:

    Why do I need a full XML parser when I can just extract what I need with regex?

    And:

    All that RPC IDL stuff is overcomplicated, REST is so much easier because I can just write the client by hand.

dang 3 years ago

Ok, we've changed this title to shrink the scope of the interpreter.

Submitted title was "YouTube-dl has a JavaScript interpreter written in 870 lines of Python".

  • ec109685 3 years ago

    Hence why HN better than Twitter.

    The amount of high engagement just plain wrong tweets there are is just sad.

tra3 3 years ago

It’s quacks like a duck at midnight, but it’s actually a frog?

blast 3 years ago

I suppose this means it would be easy for YouTube to fuck with youtube-dl simply by throwing in more features of JS?

Test0129 3 years ago

This really isn't fair. Just because it doesn't faithfully implement whatever standard Javascript is on doesn't mean it isn't an interpreter. All an interpreter is is something that executes a script directly rather than requiring compilation. It is a defacto interpreter for a subset of javascript. Nothing more, nothing less. The title could be more clear, however.

  • baobabKoodaa 3 years ago

    There's a huge difference between an interpreter for "JavaScript" and an interpreter for a "subset of JavaScript".

    • Test0129 3 years ago

      Making a pedantic argument on what constitutes an interpreter is silly. The title is bad. It is an interpreter. I'll continue to eat downvotes on this because of the pedantry of HN.

      • khazhoux 3 years ago

        Technically, it’s only the pedantry of a subset of HN.

        • lupire 3 years ago

          It's an interpretation of a subset of the pedantry on HN.

      • jraph 3 years ago

        I didn't downvote, but I don't think esprehn is being unfair. Their comment is very informative. They didn't argue that what was implemented is not an interpreter, they did explain why it's not a JavaScript interpreter and not even an interpreter for a subset of JavaScript. It's just a special purpose interpreter suitable for YouTube's code that cannot be re-used for any code that uses the subset that it seems to implement.

        It's not pedantry (or I'm pedantic). It's a reaction to the title that can lead people to believe that a complete JavaScript interpreter has been written in less than a thousand lines of Python. This reaction is perfectly understandable.

      • chess_buster 3 years ago

        I evaluated it with my Pedantic Interpreter which only results in the `pedantic` token.

      • baobabKoodaa 3 years ago

        > Making a pedantic argument on what constitutes an interpreter is silly. The title is bad. It is an interpreter.

        It's not a pedantic argument. Based on the title I thought that somebody wrote something akin to V8 in 800 lines of Python. After reading the comments I realized those 800 lines just interpret a particular JavaScript function written by Youtube. Those things are different. Pointing out the fact that they are different is not pedantry. The title is misleading and the comments pointing that out are helpful.

      • blondin 3 years ago

        my vote is meaningless and i am sorry about that. but just wanted to let you know that what you said made sense. do not let people get to you.

        most of us know that a thousand or so lines of code is not a full JavaScript interpreter and cannot be the real thing.

        there is no argument or conversation to have about it.

  • blast 3 years ago

    esprehn didn't say it isn't an interpreter. They're saying it is an interpreter and what it's interpreting isn't (all of) JS. That's also what you're saying, so you're agreeing with esprehn.

    Edit: You misunderstood baobabKoodaa in the same way. Nobody is arguing about what constitutes an interpreter, except you. The question is what language is being interpreted.

    Before accusing someone of pedantry, it would first be good not to completely misread them.